$ timeahead_
all sourcesAhead of AI (Sebastian Raschka)Anthropic NewsApple Machine Learning ResearchArs Technica AIAWS Machine Learning BlogCerebras BlogCohere BlogCrewAI BlogDeepSeek BlogDistill.pubfast.ai BlogFireworks AI BlogGoogle AI BlogGoogle Cloud AI BlogGoogle DeepMind BlogGroq BlogHaystack (deepset) BlogHugging Face BlogImport AI (Jack Clark)LangChain BlogLangFuse BlogLil'Log (Lilian Weng)LlamaIndex BlogMeta AI BlogMicrosoft AutoGen BlogMicrosoft Research BlogMistral AI NewsMIT Technology ReviewModal Blogn8n BlogNathan Lambert (RLHF)NVIDIA Developer BlogOllama BlogOpenAI BlogPerplexity AI BlogPyTorch BlogReplicate BlogSimon Willison BlogTensorFlow BlogThe Batch (DeepLearning.AI)The GradientThe Verge AITogether AI BlogVentureBeat AIvLLM BlogWeights & Biases BlogWired AIxAI (Grok) Blog
allapiagentsframeworkshardwareinframodelopen sourcereleaseresearchtutorial
★ TOP STORY[ TVA ]Infra·2d ago

OpenAI says its new GPT-5.5 model is more efficient and better at coding

OpenAI just announced its new GPT-5.5 model, which the company calls its “smartest and most intuitive to use model yet, and the next step toward a new way of getting work done on a computer.” OpenAI just released GPT-5.4 last month, but says that the new GPT-5.5 “excels” at tasks like writing and debugging code, doing research online, making spreadsheets and documents, and doing that work across different tools. OpenAI says its new GPT-5.5 model is more efficient and better at coding The new model ‘excels’ at tasks like writing and debugging code and doing work across different tools. The new model ‘excels’ at tasks like writing and debugging code and doing work across different tools. “Instead of carefully managing every step, you can give GPT-5.5 a messy, multi-part task and trust it to plan, use tools, check its work,…

The Verge AIread →
▲ trending · last 48hview all →
[ATA]Ars Technica AI· 6 articlesvisit →
2d ago
Greenhouse gases from data center boom could outpace entire nations
New gas projects linked to just 11 data center campuses around the US have the potential to create more greenhouse gases than the country of Morocco emitted in 2024. Emissions estimates from air permit documents examined by WIRED show that these natural gas projects—which are being built to power data centers to serve some of the US’s most powerful AI companies, including OpenAI, Meta, Microsoft, and xAI—have the potential to emit more than 129 million tons of greenhouse gases per year. As tech companies race to secure massive power deals to build out hundreds of data centers across the country, these projects represent just the tip of the iceberg when it comes to the potential climate cost of the AI boom. The infrastructure on this list of large natural gas projects reviewed by WIRED is being developed to largely bypass…
2dInfraby Molly Taft, wired.com
4d ago
Pentagon wants $54B for drones, more than most nations’ military budgets
The US military’s massive $1.5 trillion budget request for the next fiscal year includes what Pentagon officials described as the largest investment in drone warfare and counter-drone technology in US history. The proposed spending on drone and autonomous warfare technologies within the FY2027 budget proposal for the US Department of Defense would surpass most countries’ defense budgets and rank among the top 10 in the world for military spending, ahead of countries such as Ukraine, South Korea, and Israel. Specifically, the Pentagon is requesting $53.6 billion to boost US production and procurement of drones, train drone operators, build out a logistics network for sustaining drone deployments, and expand counter-drone systems to defend more US military sites. The funding request is budgeted under the Defense Autonomous Warfare Group (DAWG), an organization established in late 2025 that would see a massive budget…
4dInfra#agentsby Jeremy Hsu
5d ago
Robot runner handily beats humans in half-marathon, setting new record
Humanoid robots outran the fastest human competitors while surpassing the human world record during a half-marathon event held in Beijing on April 19. The demonstration of fast-improving robotic speed and autonomy comes as China’s tech industry is rapidly scaling up mass production of humanoid robots to explore possible uses in the real world. The fastest robot from Chinese smartphone-maker Honor notched a winning time of 50 minutes and 26 seconds while autonomously navigating the 13-mile (21-kilometer) route, according to the Global Times. That beat the human world record of 57 minutes and 20 seconds recently set by Ugandan long-distance runner Jacob Kiplimo during the Lisbon Half Marathon. The winning robot design took inspiration from top human athletes by incorporating long legs measuring approximately 37 inches (95 centimeters) in length, said Du Xiaodi, a test development engineer for Honor, who spoke…
5dInfra#agentsby Jeremy Hsu
9d ago
New Codex features include the ability to use your computer in the background
A new version of OpenAI’s Codex desktop app reaches users today. It brings a smorgasbord of new features and changes, ranging from new developer capabilities to expansion into non-developer knowledge work to laying the groundwork for the company’s “super app.” The most interesting for the moment is the ability to perform tasks on your PC in the background; OpenAI claims it can do this without interfering with what you are doing on your desktop. OpenAI explained the update in a blog post: With background computer use, Codex can now use all of the apps on your computer by seeing, clicking, and typing with its own cursor. Multiple agents can work on your Mac in parallel, without interfering with your own work in other apps. For developers, this is helpful for iterating on frontend changes, testing apps, or working in apps…
9dInfraby Samuel Axon
9d ago
Mozilla launches Thunderbolt AI client with focus on self-hosted infrastructure
Mozilla is the latest legacy tech brand to make a play for the enterprise AI market. But the company behind Firefox and Thunderbird isn’t releasing its own standalone AI model or agentic browser. Instead, the newly announced Thunderbolt is being sold as a front-end client for users and businesses who want to run their own self-hosted AI infrastructure without relying on cloud-based third-party services. Thunderbolt is built on top of Haystack, an existing open source AI framework that lets users build custom, modular AI pipelines from user-chosen components. Thunderbolt acts as what Mozilla calls a “sovereign AI client” on top of that underlying infrastructure. The combo promises to let users easily plug into any ACP-compatible agent or OpenAI-compatible API (including Claude, Codex, OpenClaw, DeepSeek, and OpenCode). The system can also integrate with locally stored enterprise data through open protocols and…
9dInfra#open-sourceby Kyle Orland
10d ago
Allbirds abandons clothes, pivots to "AI compute infrastructure"
If you know the name Allbirds, it’s probably for the company’s longstanding stated commitment to “sustainable shoes and apparel.” Going forward, though, the corporate entity wants to be known for its “long-term vision to become a fully integrated GPU-as-a-Service (GPUaaS) and AI-native cloud solutions provider.” In a news release Wednesday morning, Allbirds announced that it has secured a $50 million convertible finance facility to help power this unexpected “pivot … to AI compute infrastructure.” If all goes to plan, the company will soon be known as NewBird AI, by which point it will presumably change the image of a spandex-clad hiker that still sits atop its News Release page. Just weeks ago, Allbirds announced the $39 million sale of the “Allbirds brand and footwear assets” to American Exchange Group, owner of Aerosoles, Ecko Unlimited, and other fashion brands. Today’s AI…
10dInfraby Kyle Orland
[AWS]AWS Machine Learning Blog· 4 articlesvisit →
2d ago
Applying multimodal biological foundation models across therapeutics and patient care
Artificial Intelligence Applying multimodal biological foundation models across therapeutics and patient care Healthcare and life sciences decision making increasingly relies on multimodal data to diagnose diseases, prescribe medicine and predict treatment outcomes, develop and optimize innovative therapies accurately. Traditional approaches analyze fragmented data, such as ‘omics for drug discovery, medical images for diagnostics, clinical trial reports for validation, and electronic health records (EHR) for patient treatment. As a result, decision makers (CxOs, VPs, Directors) often miss critical insights hidden in the relationships between data types. Recent advancements in AI enable you to integrate and analyze these fragmented data streams efficiently to support a more complete understanding of therapeutics and patient care. AWS provides a unified environment for multimodal biological foundation models (BioFMs), enabling you to make more confident, timely decision-making in personalized medicine. This AI system combines biological data, model…
2dInfra#multimodalby Kristin Ambrosini
3d ago
Get to your first working agent in minutes: Announcing new features in Amazon Bedrock AgentCore
Artificial Intelligence Get to your first working agent in minutes: Announcing new features in Amazon Bedrock AgentCore Getting an agent running has always meant solving a long list of infrastructure problems before you can test whether the agent itself is any good. You wire up frameworks, storage, authentication, and deployment pipelines, and by the time your agent handles its first real task, you’ve spent days on infrastructure instead of agent logic. We built AgentCore from the ground up to help developers focus on building agent logic instead of backend plumbing, working with frameworks and models they already use, including LangGraph, LlamaIndex, CrewAI, Strands Agents, and more. Today, we’re introducing new capabilities that further streamline the agent building experience, removing the infrastructure barriers that slow teams down at every stage of agent development from the first prototype through production deployment. Go…
3dInfra#agentsby Madhu Parthasarathy
3d ago
Amazon SageMaker AI now supports optimized generative AI inference recommendations
Artificial Intelligence Amazon SageMaker AI now supports optimized generative AI inference recommendations Organizations are racing to deploy generative AI models into production to power intelligent assistants, code generation tools, content engines, and customer-facing applications. But deploying these models to production remains a weeks-long process of navigating GPU configurations, optimization techniques, and manual benchmarking, delaying the value these models are built to deliver. Today, Amazon SageMaker AI supports optimized generative AI inference recommendations. By delivering validated, optimal deployment configurations with performance metrics, Amazon SageMaker AI keeps your model developers focused on building accurate models, not managing infrastructure. We evaluated several benchmarking tools and chose NVIDIA AIPerf, a modular component of NVIDIA Dynamo, because it exposes detailed, consistent metrics and supports diverse workloads out of the box. Its CLI, concurrency controls, and dataset options give us the flexibility to iterate quickly and…
3dInfra#inference#codingby Mona Mona
11d ago
Use-case based deployments on SageMaker JumpStart
Artificial Intelligence Use-case based deployments on SageMaker JumpStart Amazon SageMaker JumpStart provides pretrained models for a wide range of problem types to help you get started with AI workloads. SageMaker JumpStart offers access to solutions for top use cases that can be deployed to SageMaker AI Managed Inference endpoints or SageMaker HyperPod clusters. Through pre-set deployment options, customers can quickly move from model selection to model deployment. Model deployments through SageMaker JumpStart are fast and straightforward. Customers could select options based on expected concurrent users, with visibility into P50 latency, time-to-first token (TTFT), and throughput (token/second/user). While concurrent user configuration options are helpful for general-purpose scenarios, they aren’t task-aware, and we recognize that customers use SageMaker JumpStart for diverse, specific use cases like content generation, content summarization, or Q&A. Each use case might require specific configurations to improve performance. Moreover,…
11dInfraby Dan Ferguson
[CB]Cerebras Blog· 3 articlesvisit →
3d ago
Cerebras and Docker Compose: Building Isolated AI Code Environments September 17, 2025
AI Developers run Cerebras inference inside Docker containers, with Docker Compose, to create safe environments for AI-generated code and inference speed
3d ago
Cerebras API Certification Partner Program for LLM API Providers September 22, 2025
Cerebras inference - the fastest inference API for generative AI.
3dInfra#inference
3d ago
The Fastest AI Datacenters will run on Cerebras: Meet OKC September 22, 2025
Cerebras inference - the fastest inference API for generative AI.
3dInfra#inference
[FAB]Fireworks AI Blog· 7 articlesvisit →
19d ago
Read More
Fireworks Training is now in preview: an end-to-end platform for training and deploying frontier models at scale. Three surfaces for three kinds of teams, from a conversational agent that handles everything, to managed infrastructure for ML engineers, to bring-your-own training loop on Fireworks-hosted clusters. All on the same infrastructure that already handles production inference for Cursor, Vercel, Genspark, and others. All three surfaces are in preview now. Reinforcement learning is how teams push past the ceiling SFT hits on multi-step reasoning, reliable tool use, and mid-flight self-correction. Vercel used our RL infrastructure to build a custom "Auto Fix" model for v0. The model checks the output stream for errors and self-corrects without a second pass, reaching a 93% error-free generation rate, significantly outperforming closed frontier models, with a 40X improvement in end-to-end latency vs. the proprietary model it replaced and…
33d ago
3/23/2026 Frontier RL Is Cheaper Than You Think
On this page The conventional wisdom on RL infrastructure is wrong, and it is costing teams that could be competing at the frontier. The entire mega-cluster narrative rests on a single assumption: that you have to ship 1 TB of weights every time you update your rollout fleet. You do not. Researchers have spent the last year writing about asynchronous RL and rollout-training disaggregation in systems like AReaL. Teams like Kimi and MiniMax have also published engineering notes on RL parameter updates and asynchronous scheduling. We have been running that pattern in production. That mega-cluster instinct comes from pretraining, where the main systems problem is keeping one huge synchronous training job saturated. RL is a different problem. The question is not just how to run the trainer. It is also how to keep a large rollout fleet generating data from…
33dInfra#training
46d ago
3/10/2026 Training-Inference Parity in MoE Models: Where Numerics Drift
On this page Kernel fusions that are mathematically equivalent can still drift numerically. Here are the parity bugs we hit across both Kimi K2.5 serving and Qwen3.5-MoE training bring-up. When you train a model and serve it for inference, you expect them to agree. The same weights, the same input, the same output distribution. This training–inference numerical parity matters more than it sounds: For dense models, parity is relatively easy. Mixture-of-Experts models like Kimi K2.5, Qwen3.5-MoE, and DeepSeek V3 are harder. With routed experts, shared expert pathways, and all-reduce communication twice per layer across deep stacks, there are many places where "mathematically equivalent" optimizations produce numerically different results. This post catalogs the pitfalls we found. Each is a class of optimization that inference engines use for performance, but that can silently break numerical alignment. We found most of these while…
48d ago
3/8/2026 Fireworks Acquires Hathora to Accelerate Global Compute Orchestration
Fireworks AI has acquired Hathora, and we're thrilled to bring their team and technology into the Fireworks family. Lin Qiao shared her excitement about the acquisition, noting, “Hathora’s intense focus on every millisecond and every routing decision is precisely the discipline required for cutting-edge AI inference.” Since the first multiplayer games appeared on the internet, lag has been the enemy. In gaming, milliseconds determine whether you win or lose. Speed isn’t a feature; it’s survival. AI inferences is entering that same era. Solving that requires a particular kind of team: engineers who obsess over systems, performance, and reliability at a global scale. From the beginning, Fireworks has set out to build an elite group of infrastructure engineers. People who care deeply about kernel performance, scheduling decisions, networking paths, and the invisible layers that make intelligent systems instantaneous. The Hathora team…
48dInfra#inference
48d ago
3/8/2026 Introducing Fireworks on Microsoft Foundry: Bringing Best-in-Class Open Model inference to Azure
We are excited to announce the Public Preview of Fireworks AI on Microsoft Foundry, bringing our best-in-class fast open-model serving directly into Azure. This partnership integrates Fireworks’ high-performance inference and State-of-the-Art (SOTA) open models into the unified Microsoft Foundry platform, which already offers a wide selection of models. By empowering developers with the fastest path to production-grade open-models, this milestone ensures teams using this new solution have one place to use any model, any framework, with enterprise‑grade controls to build and run AI applications and agents at scale. Across industries, organizations are increasingly standardizing on open models to get greater control over performance, cost, customization, and the security and compliance needed for enterprise deployment. With open models, teams can choose the right architecture per workload, bring their own weights, and fine-tune for quality, latency, and cost without provider lock‑in. Yet…
48dInfra#inference
85d ago
1/30/2026 The Missing Piece of the OpenClaw Mania: Truly ‘Own Your AI’ with Fireworks AI
Building a "Personal Operating System" means nothing if you don't control the brain. Move your OpenClaw agent onto secure, cost-efficient, and fully private infrastructure. The recent explosion of interest around OpenClaw (formerly Moltbot or Clawdbot) has been incredible to watch. We are finally moving past simple chatbots and into a true agentic future—where an AI can handle your emails, manage your calendar, and act as a genuine extension of yourself. It's the dawn of the personal AI operating system. But there is a massive contradiction at the heart of the current OpenClaw phenomenon. Many are building a highly intimate "personal OS" that has access to your most private data—your messages, your files, your digital life—yet most users are piping that data straight into "black box" APIs from closed-source model providers. You get convenience, but you lose control. You don't know…
89d ago
1/26/2026 Kimi K2.5 is Live on Fireworks: Vibe Coding, Agents, and Full-Parameter RFT
Kimi K2.5 is Moonshot AI’s flagship agentic model and a new SOTA open model. It unifies vision and text, thinking and non-thinking modes, and multi-agent execution into one model. We are launching Day-0 support for Kimi K2.5. Fireworks offers the fastest endpoint for all Kimi K2 series models as well as fine tuning for Kimi K2 models. Additionally, we now offer a full parameter RL tuning private preview for Kimi K2.5, enabling application builders to fine tune the SOTA OSS VLM model for use cases like vibe coding and agentic workflows. Sign up for the full parameter RL tuning waitlist here. Kimi K2.5 demonstrates that open source models are now surpassing their closed-source counterparts. The chart provides more details on the multiple benchmarks where Kimi K2.5 achieves SOTA results, including for Agents (HLE Full, BrowseComp, and Deepsearch) and for Vision…
[GDM]Google DeepMind Blog· 2 articlesvisit →
39d ago
Broadening advanced AI education across Africa
Broadening advanced AI education across Africa AI is driving scientific discoveries and research breakthroughs, but its progress depends on a global community. To bridge the gap between talent and opportunity, Google DeepMind is launching additional courses of its AI Research Foundations curriculum: advanced AI education designed for the next generation of technical learners across Africa. Hands-on experience with generative AI models The courses, developed with pedagogy experts and academics at University College London — and available at no cost on Google Skills — give learners the opportunity to build and fine-tune a language model from the ground up. Google.org is supporting the curriculum’s rollout in African classrooms by providing funding for lecturer training and instructional toolkits. The curriculum, already serving thousands of users globally, moves beyond AI literacy, providing technical university students and community learners with a deep, applied understanding…
39dInfraby Leslie Yeh
88d ago
In our latest podcast, hear how the “Smoke Jumpers” team brings Gemini to billions of people.
Bringing Gemini to billions of users requires a massive, coordinated infrastructure effort. In the latest episode of the Google AI: Release Notes podcast, host Logan Kilpatrick sits down with Emanuel Taropa to discuss the "Smokejumpers,” a nimble, cross-functional team of engineers and product experts that handle Google's most complex and critical AI launches. In the episode, they explore the technical connective tissue that makes Gemini 3 possible, the advantages of Google’s TPU strategy, and the high-intensity culture that builds and ships world-class AI models at scale. Hear the full conversation below, or listen to the Google AI: Release Notes podcast on Apple Podcasts or Spotify.
88dInfra#gemini
[GB]Groq Blog· 2 articlesvisit →
16d ago
Canopy Labs’ Orpheus TTS is live on GroqCloud
Canopy Labs’ Orpheus TTS is live on GroqCloud In December, we announced support for Canopy Labs’ Orpheus text-to-speech (TTS) on GroqCloud, with two model variants built for real-time, high-quality voices: - English TTS: canopylabs/orpheus-v1-english (with vocal directions) - Saudi Arabic (dialect) TTS: canopylabs/orpheus-arabic-saudi (authentic pronunciation + regional nuance) Today, we’re excited to announce a new release of the Saudi Arabic Orpheus TTS model on GroqCloud (canopylabs/orpheus-arabic-saudi). This release brings overall model improvements, including reduced hallucinations, more natural and expressive speech, and more accurate handling of numbers and symbols. It also introduces two new Saudi Arabic voices designed to sound more natural, culturally grounded, and production-ready. - Abdullah — A professional, calm, and conversational male voice, ideal for assistants, enterprise workflows, and general voice interfaces. - Aisha — A professional, clear, and approachable female voice, especially effective for customer support and…
16dInfra#inference
68d ago
GroqCloud: Expanding to Meet Demand
GroqCloud: Expanding to Meet Demand Demand for high-performance AI inference is accelerating globally, driven by real-time applications moving from experimentation into production. As this shift takes hold, infrastructure that delivers predictable performance, low latency, and efficient scale is becoming increasingly critical. At Groq, our architecture, roadmap, and customer commitments remain Groq-led. At the same time, GroqCloud adoption continues to support our planned global infrastructure expansion, enabling reliable inference deployments for developers and enterprises wherever they operate. Scaling GroqCloud for Production Workloads As interest in inference-optimized infrastructure continues to rise, GroqCloud has seen record levels of developers—now exceeding 3.5 million—along with sustained increases in production traffic. Teams across industries are using GroqCloud to power real-time applications where consistency, determinism, and cost efficiency are non-negotiable. To support this momentum, Groq is continuing to scale GroqCloud’s global availability. New UK Data Center Expands…
[H(B]Haystack (deepset) Blog· 1 articlesvisit →
46d ago
Multimodality Embeddings Bilge Yücel DevRel Engineer Stefano Fiorucci AI/Software Engineer Multimodal Search with Gemini Embedding 2 in Haystack Build multimodal search systems in Haystack using Gemini Embedding 2 to embed text, images, video, audio, and PDFs in a shared vector space. March 10, 2026
Multimodal Search with Gemini Embedding 2 in Haystack Build multimodal search systems in Haystack using Gemini Embedding 2 to embed text, images, video, audio, and PDFs in a shared vector space. March 10, 2026Embeddings are the backbone of modern AI applications, from semantic search and recommendation systems to Retrieval-Augmented Generation (RAG). However, most embedding models operate in a single modality, typically focusing only on textual data. Google has introduced Gemini Embedding 2, a fully multimodal embedding model that maps text, images, video, audio, and PDFs into a shared vector space. This means you can search across different types of data using a single embedding model: gemini-embedding-2-preview . Even better, Haystack supports Gemini Embedding 2 from Day 0. Through the Google GenAI x Haystack integration, you can immediately start using the model in your Haystack applications for both text and multimodal…
[HF]Hugging Face Blog· 11 articlesvisit →
4d ago
AI and the Future of Cybersecurity: Why Openness Matters
AI and the Future of Cybersecurity: Why Openness Matters What is Mythos? Mythos is a “frontier AI model”, a large language model (LLM) that can be used to process software code (among many other things). This follows a general trend in LLM development, where LLM performance on code-related tasks has recently skyrocketed. What’s particularly significant about Mythos is the system it’s embedded within: It's the system, not the model alone, that has enabled Mythos to rapidly find and patch software vulnerabilities. Understanding this distinction is key to understanding the current landscape of AI cybersecurity. What Mythos demonstrates is that the following system recipe is powerful: - substantial compute power - models trained on troves of software-relevant data - scaffolding built to handle software vulnerability probing and patching - speed (enabled by compute power and the capital behind it) - some…
4dInfra#coding
9d ago
Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers
Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers As a practical example, I'll walk through finetuning Qwen/Qwen3-VL-Embedding-2B for Visual Document Retrieval (VDR), the task of retrieving relevant document pages (as images, with charts, tables, and layout intact) for a given text query. The resulting tomaarsen/Qwen3-VL-Embedding-2B-vdr demonstrates how much performance you can gain by finetuning on your own domain. On my evaluation data, the finetuned model achieves an NDCG@10 of 0.947 compared to the base model's 0.888, and outperforms all existing VDR models I tested against, including models up to 4x its size. If you're new to multimodal models in Sentence Transformers, I recommend reading Multimodal Embedding & Reranker Models with Sentence Transformers first. For training text-only embedding, reranker, or sparse embedding models, see the Prior Blogposts section at the end. Table of Contents - Why Finetune? -…
9d ago
Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents
Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents TL;DR — We extend the RLVE framework from single-turn reasoning puzzles to multi-turn, tool-augmented e-commerce conversations. EcomRLVE-GYM provides 8 verifiable environments — product discovery, substitution, cart building, returns, order tracking, policy QA, bundle planning, and multi-intent journeys — each with procedural problem generation, a 12-axis difficulty curriculum, and algorithmically verifiable rewards. We train a Qwen 3 8B model with DAPO over 300 steps and present early results demonstrating that environment scaling and adaptive difficulty transfer to agentic, real-world task completion. This project originated in the Pytorch OpenEnv Hackathon and is still evolving, follow us for updates 🔥 Why RL for shopping agents? Large language models can hold fluent conversations, yet deploying them as shopping assistants reveals a persistent gap: fluency ≠ task completion. A customer who asks "find me a USB-C charger…
10d ago
Meet HoloTab by HCompany. Your AI browser companion.
Meet HoloTab by HCompany. Your AI browser companion. We built one of the most powerful computer-use AIs in the world. And made it directly accessible from your browser. On March 31st, we released Holo3, our most advanced computer-use model to date. Building something powerful is one thing; making it accessible and easy to use is another. We’re doing both. HoloTab is a Chrome extension that navigates the web just like a person would. It automates tasks across any website with zero setup or technical skills required. You describe what you want, and the agent handles it directly inside your browser, navigating interfaces, filling fields, and making decisions the same way you would. The vision models, the action planning, the interface understanding, all of it is running underneath, working for you, and all you ever see is the result. Routines: Show…
16d ago
Multimodal Embedding & Reranker Models with Sentence Transformers
Multimodal Embedding & Reranker Models with Sentence Transformers Multimodal embedding models map inputs from different modalities into a shared embedding space, while multimodal reranker models score the relevance of mixed-modality pairs. This opens up use cases like visual document retrieval, cross-modal search, and multimodal RAG pipelines. If you want to train your own multimodal models, check out the companion blogpost: Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers. Table of Contents - What are Multimodal Models? - Installation - Multimodal Embedding Models - Multimodal Reranker Models - Retrieve and Rerank - Input Formats and Configuration - Supported Models - Additional Resources What are Multimodal Models? Traditional embedding models convert text into fixed-size vectors. Multimodal embedding models extend this by mapping inputs from different modalities (text, images, audio, or video) into a shared embedding space. This means you…
23d ago
Welcome Gemma 4: Frontier multimodal intelligence on device
Welcome Gemma 4: Frontier multimodal intelligence on device These models are the real deal: truly open with Apache 2 licenses, high quality with pareto frontier arena scores, multimodal including audio, and sizes you can use everywhere including on-device. Gemma 4 builds on advances from previous families and makes them click together. In our tests with pre-release checkpoints we have been impressed by their capabilities, to the extent that we struggled to find good fine-tuning examples because they are so good out of the box. We collaborated with Google and the community to make them available everywhere: transformers, llama.cpp, MLX, WebGPU, Rust; you name it. This blog post will show you how to build with your favorite tools so let us know what you think! Table of Contents - What is New with Gemma 4? - Overview of Capabilities and Architecture…
25d ago
Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents
Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents - Table Extraction: Accurately parsing complex table structures (e.g., multi-row, multi-column, etc.) from document images - Chart Understanding: Converting charts and figures into structured machine-readable formats, summaries, or executable code - Semantic Key-Value Pair (KVP) Extraction: Identifying and grounding semantically meaningful key-value field pairs across diverse document layouts The model ships as a LoRA adapter on top of Granite 4.0 Micro, our dense language model, keeping vision and language modular for text-only fallbacks and seamless integration into mixed pipelines. It continues to support vision-language tasks such as producing detailed natural-language descriptions from images (e.g., “Describe this image in detail”). The model can be used standalone or in tandem with Docling to enhance document processing pipelines with deep visual understanding capabilities. How Granite 4.0 3B Vision Was Built Granite 4.0 3B…
25dInfra#multimodal
39d ago
Holotron-12B - High Throughput Computer Use Agent
Holotron-12B - High Throughput Computer Use Agent We're thrilled to release Holotron-12B, a multimodal computer-use model from H Company. Post-trained from the open NVIDIA Nemotron-Nano-2 VL model on H Company’s proprietary data mixture, Holotron-12B is the result of a close collaboration between our research labs to engineer a new type of model optimized primarily for scale and performance in production. H Company is part of the NVIDIA Inception Program. The model is now available on Hugging Face. Why We Built Holotron-12B Most multimodal models today optimize primarily for static vision or following instructions. Holotron-12B, just like our Holo2 model, however, has a different goal: serving as a policy model for computer-use agents that must perceive, decide, and act efficiently in interactive environments. With Holotron-12B, we wanted to create a model that could efficiently and effectively scale in production while handling…
47d ago
LeRobot v0.5.0: Scaling Every Dimension
LeRobot v0.5.0: Scaling Every Dimension TL;DR LeRobot v0.5.0 adds full Unitree G1 humanoid support (whole-body control models), new policies –including Pi0-FAST autoregressive VLAs and Real-Time Chunking for responsive inference–, and streaming video encoding that eliminates wait times between recording episodes. The release also introduces EnvHub for loading simulation environments from the Hugging Face Hub, NVIDIA IsaacLab-Arena integration, and a major codebase modernization with Python 3.12+, Transformers v5, and third-party policy plugins. Table of Contents - LeRobot v0.5.0: Scaling Every Dimension Hardware: More Robots Than Ever LeRobot v0.5.0 dramatically expands the roster of supported hardware — from arms and mobile robots to a full humanoid. Unitree G1 Humanoid The biggest hardware addition in this release: full Unitree G1 humanoid support. This is LeRobot's first humanoid integration, and it's comprehensive: - Locomotion: Walk, navigate, and move through environments. - Manipulation: Perform dexterous…
47dInfra
51d ago
Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations
Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations Authors: Enzo Ruedas, Tess Boivin Recent advances in Large Language Models have enabled the transition from text-only reasoning to multimodal systems. First, with the integration of visual perception in Vision–Language Models (VLMs), and more recently with the generation of robot actions in Vision–Language–Action (VLA) models. Deploying these models on embedded robotic platforms remains a challenge due to tight constraints in terms of compute, memory, and power, as well as real-time control requirements. In synchronous control pipelines, while the VLA is running inference, the arm is idle awaiting commands leading to oscillatory behavior and delayed corrections. To tackle that, asynchronous Inference can enable smooth and continuous motion by dissociating generation from execution. However, to be effective, the end-to-end inference latency must remain shorter than the action execution duration.…
64d ago
Train AI models with Unsloth and Hugging Face Jobs for FREE
Train AI models with Unsloth and Hugging Face Jobs for FREE LiquidAI/LFM2.5-1.2B-Instruct ) through coding agents like Claude Code and Codex. Unsloth provides ~2x faster training and ~60% less VRAM usage compared to standard methods, so training small models can cost just a few dollars. Why a small model? Small language models like LFM2.5-1.2B-Instruct are ideal candidates for fine-tuning. They are cheap to train, fast to iterate on, and increasingly competitive with much larger models on focused tasks. LFM2.5-1.2B-Instruct runs under 1GB of memory and is optimized for on-device deployment, so what you fine-tune can be served on CPUs, phones, and laptops. You will need We are giving away free credits to fine-tune models on Hugging Face Jobs. Join the Unsloth Jobs Explorers organization to claim your free credits and one-month Pro subscription. - A Hugging Face account (required for…
[MRB]Microsoft Research Blog· 2 articlesvisit →
3d ago
AutoAdapt: Automated domain adaptation for large language models
At a glance - Problem: Adapting large language models to specialized, high-stakes domains is slow, expensive, and hard to reproduce. - What we built: AutoAdapt automates planning, strategy selection (e.g., RAG vs. fine-tuning), and tuning under real deployment constraints. - How it works: A structured configuration graph maps the full scope of the adaptation process, an agentic planner selects and sequences the right steps, and a budget-aware optimization loop (AutoRefine) refines the process within defined constraints. - Why it matters: The result is faster, automated, more reliable domain adaptation that turns weeks of manual iteration into repeatable pipelines. Deploying large language models (LLMs) in real-world, high-stakes settings is harder than it should be. In high-stakes settings like law, medicine, and cloud incident response, performance and reliability can quickly break down because adapting models to domain-specific requirements is a slow and…
3dInfra#rag#agents#fine-tuningby Sidharth Sinha, Anson Bastos, Xuchao Zhang, Akshay Nambi, Rujia Wang, Chetan Bansal
52d ago
Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model
At a glance - Phi-4-reasoning-vision-15B is a compact and smart open‑weight multimodal reasoning model that balances reasoning power, efficiency, and training data needs. It is a broadly capable model that allows for natural interaction for a wide array of vision-language tasks and excels at math and science reasoning and understanding user-interfaces. - We share lessons learned and best practices for training a multimodal reasoning model—showing the benefit of careful architecture choices, rigorous data curation, and the benefits of using a mixture of reasoning and non-reasoning data. We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as image captioning,…
52dInfra#phi#multimodal#trainingby Jyoti Aneja, Michael Harrison, Neel Joshi, Tyler LaBonte, John Langford, Eduardo Salinas
[MTR]MIT Technology Review· 1 articlesvisit →
5d ago
Chinese tech workers are starting to train their AI doubles—and pushing back
Chinese tech workers are starting to train their AI doubles—and pushing back A viral GitHub project that claims to clone coworkers into a reusable AI skill is forcing Chinese tech workers to confront deeper fears. Tech workers in China are being instructed by their bosses to train AI agents to replace them—and it’s prompting a wave of soul-searching among otherwise enthusiastic early adopters. Earlier this month a GitHub project called Colleague Skill, which claimed workers could use it to “distill” their colleagues’ skills and personality traits and replicate them with an AI agent, went viral on Chinese social media. Though the project was created as a spoof, it struck a nerve among tech workers, a number of whom told MIT Technology Review that their bosses are encouraging them to document their workflows in order to automate specific tasks and processes…
5dInfraby Caiwei Chen
[NV]NVIDIA Developer Blog· 25 articlesvisit →
3d ago
Advancing Emerging Optimizers for Accelerated LLM Training with NVIDIA Megatron
Higher-order optimization algorithms such as Shampoo have been effectively applied in neural network training for at least a decade. These methods have achieved significant success more recently when applied to leading LLMs. In particular, Muon (MomentUm Orthogonalized by Newton-Schulz) was used to train some of today’s best open source models, including Kimi K2 and GLM-5. This post explains how NVIDIA provides comprehensive support for Muon and other cutting-edge emerging optimizers and the technologies enabling them to train large-scale models. Muon training performance on NVIDIA GB300 NVL72 Table 1 summarizes training throughput of the Kimi K2 and Qwen3 30B models with Muon and the AdamW optimizer on the NVIDIA GB300 NVL72 system. With the technologies that will be introduced in the next section, the results show that there is a very small training performance loss using the Muon optimizer compared to…
5d ago
Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision
As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy Optimization (GRPO) power this transition, enabling reasoning-grade models to continuously improve through iterative feedback. Unlike standard supervised fine-tuning, RL training loops are bifurcated into two distinct, high-intensity phases: a generation phase with a stringent latency requirement and a training phase requiring high throughput. To make these workloads viable, researchers and engineers are turning to low-precision datatypes like FP8 to boost performance in training and throughput-oriented generation. Moreover, in some scenarios where generation is bound by GPU memory bandwidth, using low-precision parameters can improve performance due to fewer bytes per parameter. This post dives deep into the systemic challenges of low-precision RL and how NVIDIA NeMo RL—an open source library within the NVIDIA NeMo framework—speeds up RL workloads while…
5dInfra#inference#trainingby Guyue Huang
23d ago
Achieving Single-Digit Microsecond Latency Inference for Capital Markets
In algorithmic trading, reducing response times to market events is crucial. To keep pace with high-speed electronic markets, latency-sensitive firms often use specialized hardware like FPGAs and ASICs. Yet, as markets grow more efficient, traders increasingly depend on advanced models such as deep neural networks to enhance profitability. Because implementing these complex models on low-level hardware requires significant investment, general-purpose GPUs offer a practical, cost-effective alternative. The NVIDIA GH200 Grace Hopper Superchip in the Supermicro ARS-111GL-NHR server has achieved single-digit microsecond latencies in the STAC-ML Markets (Inference) benchmark, Tacana suite (audited by STAC), providing performance comparable to or better than specialized hardware systems. This post details these record-breaking results and provides a deep dive into the custom-tailored solutions required for low-latency GPU inference. It also walks you through an open source reference implementation and a tutorial for getting started. STAC-ML…
23dInfra#inferenceby Nikolay Markovskiy
23d ago
Bringing AI Closer to the Edge and On-Device with Gemma 4
The Gemmaverse expands with the launch of the latest Gemma 4 multimodal and multilingual models, designed to scale across the full spectrum of deployments, from NVIDIA Blackwell in the data center to Jetson at the edge. These models are suited to meet the growing demand for local deployment for AI development and prototyping, secure on-prem requirements, cost efficiency, and latency-sensitive use cases. The newest generation improves both efficiency and accuracy, making these general-purpose models well-suitable for a wide range of common tasks: - Reasoning: Strong performance on complex problem-solving tasks. - Coding: Code generation and debugging for developer workflows. - Agents: Native support for structured tool use (function calling). - Vision, video and audio capability: Enables rich multimodal interactions for use cases such as object recognition, automated speech recognition (ASR), document and video intelligence, and more. - Interleaved multimodal input:…
23dInfra#multimodal#localby Anu Srivastava
31d ago
Scaling Token Factory Revenue and AI Efficiency by Maximizing Performance per Watt
In the AI era, power is the ultimate constraint, and every AI factory operates within a hard limit. This makes performance per watt—the rate at which power is converted into revenue-generating intelligence—the defining metric for modern AI infrastructure. AI data centers now operate as token factories tied directly to the energy ecosystem, where access to land, power, and shell determines deployment, and efficiency determines output. Increasing revenue within a fixed power envelope depends entirely on maximizing intelligence per watt across AI infrastructure and across the five-layer AI cake ecosystem. This post walks through how NVIDIA architectures, systems, and AI factory software maximize performance per watt at every layer of the stack, and how those efficiency gains translate into higher token throughput and revenue per megawatt. Compounding performance per watt across NVIDIA GPU architectures NVIDIA architectures and platforms are engineered to…
31dInfraby Kibibi Moseley
32d ago
Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety
Agentic AI is an ecosystem where specialized models work together to handle planning, reasoning, retrieval, and safety guardrailing. As these systems scale, developers need models that can understand real-world multimodal data, converse naturally with users globally, and operate safely across languages and modalities. At GTC 2026, NVIDIA introduced a new generation of NVIDIA Nemotron models designed to work together as a unified agentic stack: - NVIDIA Nemotron 3 Super for long-context reasoning and agentic tasks - NVIDIA Nemotron 3 Ultra (coming soon) for highest reasoning accuracy and efficiency among open frontier models - NVIDIA Nemotron 3 Content Safety for multimodal, multilingual content moderation - NVIDIA Nemotron 3 VoiceChat (in early access) for low latency, natural, full-duplex voice interactions - NVIDIA Nemotron 3 Nano Omni (coming soon) for enterprise-grade multimodal understanding - NVIDIA Nemotron RAG for generating embeddings for image and…
32dInfra#rag#agents#multimodal#gpuby Chintan Patel
33d ago
Deploying Disaggregated LLM Inference Workloads on Kubernetes
As large language model (LLM) inference workloads grow in complexity, a single monolithic serving process starts to hit its limits. Prefill and decode stages have fundamentally different compute profiles, yet traditional deployments force them onto the same hardware, leaving GPUs underutilized and scaling inflexible. Disaggregated serving addresses this by splitting the inference pipeline into distinct stages such as prefill, decode, and routing, each running as an independent service that can be resourced and scaled on its own terms. This post will give an overview of how disaggregated inference gets deployed on Kubernetes, explore different ecosystem solutions and how they execute on a cluster, and evaluate what they provide out of the box. How do aggregated and disaggregated inference differ? Before diving into Kubernetes manifests, it helps to understand the two inference deployment modes for LLMs: In aggregated serving, a single…
33dInfra#inference#codingby Anish Maddipoti
39d ago
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
AI-native services are exposing a new bottleneck in AI infrastructure: As millions of users, agents, and devices demand access to intelligence, the challenge is shifting from peak training throughput to delivering deterministic inference at scale—predictable latency, jitter, and sustainable token economics. NVIDIA announced at GTC 2026 that telcos and distributed cloud providers are transforming their networks into AI grids, embedding accelerated computing across a mesh of regional POPs, central offices, metro hubs, and edge locations to meet the needs of AI-native services. This post explains how AI grids make real-time, multi-modal, and hyper-personalized AI experiences viable at scale by running inference across distributed, workload-, resource- and KPI-aware AI infrastructure. Intelligent workload placement across distributed sites The NVIDIA AI Grid reference design provides a unified framework for building geographically distributed, interconnected, and orchestrated AI infrastructure. Figure 1 shows how existing network…
39dInfra#gpuby Sree Sankar
40d ago
NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer
Artificial intelligence is token-driven. Every prompt, reasoning step, and agent interaction generates tokens. Over the past year, token consumption has grown multifold and now exceeds 10 quadrillion tokens per year. And while the majority of tokens have been generated from humans interacting with AI, the new era is one in which most tokens will be generated from AI interacting with AI. Modern agentic systems plan tasks, invoke tools, execute code, retrieve data, and coordinate across continuous multistep workflows with numerous AI agents. These interactions generate large volumes of reasoning tokens, expand KV cache, and require CPU-based sandboxed environments to test and validate results generated by accelerated computing systems. This places low latency, high throughput demands across GPUs, CPUs, scale-up domains, scale-out networks, and storage. Delivering useful intelligence for these modern agentic systems requires fleets of purpose-built rack-scale systems that function…
40dInfra#agents#gpuby Rohil Bhargava
40d ago
NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories
AI is evolving, and reasoning models are increasing token demand, placing new requirements on every layer of AI infrastructure. More than ever, compute must scale efficiently to maximize token production and improve productivity for model creators and users. Modern GPUs operate at peak capacity, pushing throughput higher every generation, but system performance is increasingly gated by the CPU-bound serial tasks within an agentic loop–a classic example of a core computer science principle, called Amdahl’s law. This dynamic is especially visible in two classes of workloads: reinforcement learning (RL) for training models with new specialized skills such as coding or engineering, and agentic actions, which enable AI agents to use tools like web browsers, databases, code interpreters, and other software to complete tasks in real environments, or sandboxes. Both workloads combine two historically separate CPU characteristics. Individual environments require strong single-threaded…
40dInfra#gpuby Praveen Menon
40d ago
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Building AI factories is complex and requires efficient integration across compute, networking, security, and storage systems. To achieve rapid Time to AI and strong ROI, the new NVIDIA DSX Air is enabling organizations to simulate their entire AI factory infrastructure in the cloud—covering compute, networking, storage, and security. Being able to design, test, and optimize systems before deploying hardware enables every layer of the AI factory to function as a unified, optimized system, preventing major delays or performance issues related to integration or misconfiguration challenges. DSX Air also enables continuous testing and validation of provisioning, automation, and security policies to streamline ongoing operations. This post shows how users can benefit from NVIDIA DSX Air through accelerated deployment timelines and simplified, full-stack cluster management. How DSX Air enables AI factory simulation To make AI factory simulation useful and practical for end…
40dInfra#rag#gpuby Ranga Maddipudi
40d ago
Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark
Autonomous AI agents are driving the next wave of AI innovation. These agents must often manage long-running tasks that use multiple communication channels and background subprocesses simultaneously to explore options, test solutions, and generate optimal results. This places extreme demands on local compute. NVIDIA DGX Spark provides the performance necessary for autonomous agents to execute these complex workflows efficiently and locally. Now with NVIDIA NemoClaw, part of the NVIDIA Agent Toolkit, it installs the NVIDIA OpenShell runtime—a secure environment for running autonomous agents, and open source models like NVIDIA Nemotron. This post discusses several important aspects of system capabilities and performance that are necessary to power always-on autonomous agents and explains why NVIDIA DGX Spark is an ideal desktop platform for autonomous AI. Inference for autonomous AI agents Agentic tools often need to process massive context windows. OpenClaw, for example,…
40dInfra#agents#gpuby Allen Bourgoyne
40d ago
Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI
AI‑native organizations increasingly face scaling challenges as agentic AI workflows drive context windows to millions of tokens and models scale toward trillions of parameters. These systems rely on agentic long‑term memory for context that persists across turns, tools, and sessions so agents can build on prior reasoning instead of starting from scratch on every request. As context windows increase, Key-Value (KV) cache capacity requirements grow proportionally, while the compute requirements to recalculate that history grow much faster, making KV cache reuse and efficient storage essential for performance and efficiency. This increases pressure on existing memory hierarchies, forcing AI providers to choose between scarce GPU high‑bandwidth memory (HBM) and general‑purpose storage tiers optimized for durability, data management, and protection—not for serving ephemeral, AI-native, KV cache—driving up power consumption, inflating cost per token, and leaving expensive GPUs underutilized. The NVIDIA Vera Rubin…
40dInfra#rag#agents#gpuby Moshe Anschel
40d ago
Using Simulation to Build Robotic Systems for Hospital Automation
Healthcare faces a structural demand–capacity crisis: a projected global shortfall of ~10 million clinicians by 2030, billions of diagnostic exams annually with significant unmet demand, hundreds of millions of procedures with large access gaps, and costly operating room (OR) inefficiencies measured in tens of dollars per minute. The future hospital must therefore be automation-enabled—where robotics extends clinician capacity, increases procedural throughput, reduces variability, and democratizes access to high-quality care. Imagine autonomous imaging robots navigating patient anatomy to provide X-rays for the unserved billions, while in the OR, ‘Surgical Subtask Automation’ handles repetitive suturing so surgeons can focus on critical decisions. Beyond the bedside, service robots recapture wasted minutes by autonomously delivering supplies, saving nurses miles of walking. The data gap and real-world limits The core bottleneck is data. Hospitals are heterogeneous, chaotic, and high-stakes environments—every facility has different layouts, workflows,…
40dInfra#agents#inferenceby Mingxin Zheng
44d ago
Build Accelerated, Differentiable Computational Physics Code for AI with NVIDIA Warp
Computer-aided engineering (CAE) is shifting from human-driven workflows toward AI-driven ones, including physics foundation models that generalize across geometries and operating conditions. Unlike LLMs, these models depend on large volumes of high-fidelity, physics-compliant data. Recent scaling-law work on computational fluid dynamics (CFD) surrogates indicates that simulation-generated training data is often the limiting cost in practice. This pushes requirements onto the simulator, which must be GPU-native, fast, and able to plug directly into ML workflows. NVIDIA Warp is a framework for accelerated simulation, data generation, and spatial computing that bridges CUDA and Python. Warp enables developers to write high-performance kernels as regular Python functions that are JIT-compiled into efficient code for execution on the GPU. Unlike the tensor-based frameworks, in which developers express computation as operations on entire N-dimensional arrays, developers author flexible kernels in the Warp framework that execute simultaneously…
44dInfra#agents#coding#gpuby Sheel Nidhan
46d ago
Reliable AI Coding for Unreal Engine: Improving Accuracy and Reducing Token Costs
Agentic code assistants are moving into daily game development as studios build larger worlds, ship more DLCs, and support distributed teams. These assistants can accelerate development by helping with tasks like generating gameplay scaffolding, refactoring repetitive systems, and answering engine-specific questions faster. This post outlines how developers can build reliable AI coding workflows for Unreal Engine (UE) 5, from individual setups to team and enterprise-scale systems. Reliability is critical because real-world Unreal codebases are defined by engine conventions, large C++ projects, custom tools, branch differences, and studio-specific coding patterns that generic AI often fails to understand. The core challenge is the context gap. Failures rarely come from weak code generation, but from missing constraints such as code patterns, branch differences, or internal conventions. Improving context retrieval reduces guesswork and makes AI output reliable enough for production use. NVIDIA works with…
46dInfra#agents#codingby Paul Logan
47d ago
Removing the Guesswork from Disaggregated Serving
Deploying and optimizing large language models (LLMs) for high-performance, cost-effective serving can be an overwhelming engineering problem. The ideal configuration for any given workload (such as hardware, parallelism, and prefill/decode split) resides in a massive, multi-dimensional search space that is impossible to explore manually or through exhaustive testing. AIConfigurator, an open source tool that simplifies the NVIDIA Dynamo AI serving stack, is intended to cut through this complexity and get you to an optimal deployment in minutes. The core benefit of AIConfigurator is that you don’t need to run every possible configuration on real hardware to predict which one will perform best. Instead, it decomposes LLM inference into its constituent operations and measures each one in isolation on the target GPU. AIConfigurator can then reassemble those measurements to estimate the end-to-end performance of any configuration, all without occupying a single…
47dInfra#inferenceby Tianhao Xu
61d ago
Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy
As the sizes of AI models and datasets continue to increase, relying only on higher-precision BF16 training is no longer sufficient. Key challenges such as training throughput expectations, memory limits, and rising costs are becoming the primary barriers to scaling transformer models. Using lower-precision training can address these challenges. By reducing the numeric precision used during computation, GPUs can process more operations per cycle, enhancing training efficiency and lowering costs. This post compares the following three low-precision training formats directly against established BF16 precision training across multi-hundred-billion token pretraining runs and downstream benchmarks: - 8-bit floating point per-tensor current scaling (FP8-CS) - Mixed precision training with FP8 (MXFP8) - NVFP4 precision training using NVIDIA NeMo Megatron Bridge, an open source library that is part of NVIDIA NeMo framework We present practical, large-scale results showing how low-precision training delivers up to…
61dInfra#inference#trainingby Aditya Vavre
67d ago
Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities
Enterprise data is inherently complex: real-world documents are multimodal, spanning text, tables, charts and graphs, images, diagrams, scanned pages, forms, and embedded metadata. Financial reports carry critical insights in tables, engineering manuals rely on diagrams, and legal documents often include annotated or scanned content. Retrieval-augmented generation (RAG) was created to ground LLMs in trusted enterprise knowledge—retrieving relevant source data at query time to reduce hallucinations and improve accuracy. But if a RAG system processes only surrounding text, it misses key signals embedded in tables, charts, and diagrams—resulting in incomplete or incorrect answers. An intelligent agent is only as good as the data foundation it’s built on. Modern RAG must therefore be inherently multimodal—able to understand both visual and textual context to achieve enterprise-grade accuracy. The NVIDIA Enterprise RAG Blueprint is built for this, providing a modular reference architecture that connects…
67dInfra#rag#multimodalby Shruthii Sathyanarayanan
74d ago
R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab
Building robust, intelligent robots requires testing them in complex environments. However, gathering data in the physical world is expensive, slow, and often dangerous. It is nearly impossible to safely train for real-world critical risks, such as high-speed collisions or hardware failures. Worse, real-world data is usually biased toward “normal” conditions, leaving robots unprepared for the unexpected. Simulation is essential to bridge this gap, providing a risk-free environment for rigorous development. However, traditional pipelines struggle to support the complex needs of modern robotics. Today’s generalist robots must master multimodal learning—fusing diverse inputs such as vision, touch, and proprioception to navigate messy, unstructured worlds. This creates a new requirement for simulation: it must deliver scale, realism, and multimodal sensing all in one tight training loop, something traditional CPU-bound simulators cannot handle efficiently. This edition of NVIDIA Robotics Research and Development Digest (R²D²)…
74dInfra#multimodal#gpuby Oyindamola Omotuyi
75d ago
Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy
NVIDIA TensorRT LLM enables developers to build high-performance inference engines for large language models (LLMs), but deploying a new architecture traditionally requires significant manual effort. To address this challenge, today we are announcing the availability of AutoDeploy as a beta feature in TensorRT LLM. AutoDeploy compiles off-the-shelf PyTorch models into inference-optimized graphs. This avoids the need to bake inference-specific optimizations directly into model code, reducing LLM deployment time. AutoDeploy enables the shift from manually reimplementing and optimizing each model toward a compiler-driven workflow that separates model authoring from inference optimization. This post introduces AutoDeploy architecture and capabilities and shows how it enabled support for recent NVIDIA Nemotron models at launch. What is AutoDeploy? Every new LLM architecture comes with its own inference challenges, from transformer models to hybrid vision language models (VLMs) to state space models (SSMs). Turning a reference…
75dInfra#agents#inference#multimodal#codingby ​​Lucas Liebenwein
78d ago
3 Ways NVFP4 Accelerates AI Training and Inference
The latest AI models continue to grow in size and complexity, demanding increasing amounts of compute performance for training and inference—far beyond what Moore’s Law can keep up with. That’s why NVIDIA engages in extreme codesign. Designing across multiple chips and a mountain of software cohesively enables large generational leaps in AI factory performance and efficiency. Lower-precision AI formats are key to improving compute performance and energy efficiency. Bringing the benefits of ultra-low-precision numerics to AI training and inference while maintaining high accuracy requires extensive engineering across every layer of the technology stack. It spans the creation of the formats, implementation in silicon, enablement across many libraries, and working closely with the ecosystem to deploy new training recipes and inference optimization techniques. NVFP4, developed and implemented for NVIDIA GPUs starting with NVIDIA Blackwell, delivers the performance and energy-efficiency benefits of…
78dInfra#inference#trainingby Ashraf Eassa
79d ago
How Painkiller RTX Uses Generative AI to Modernize Game Assets at Scale
Painkiller RTX sets a new standard for how small teams can balance massive visual ambition with limited resources by integrating generative AI. By upscaling thousands of legacy textures into high-quality Physically Based Rendering (PBR) materials—a process that would have traditionally taken years—the team dramatically reduced the burden of repetitive work. This approach was especially impactful for contributors without traditional modding backgrounds, freeing them to focus on creative decisions: refining materials and ensuring the game’s iconic atmosphere responds correctly to ray-traced lighting. Learn how the team architected a production pipeline that blends automation with artistic judgment across 35 unique levels. To explore the motivations, solutions, and lessons behind these technical challenges, we spoke with McGillacutty (environment reconstruction and material lead), Quinn Baddams (team lead and founder of Merry Pencil Studios), and NightRaven (creator of PBRFusion). What’s your professional background and current…
79dInfraby Phillip Singh
87d ago
Updating Classifier Evasion for Vision Language Models
Advances in AI architectures have unlocked multimodal functionality, enabling transformer models to process multiple forms of data in the same context. For instance, vision language models (VLMs) can generate output from combined image and text input, enabling developers to build systems that interpret graphs, process camera feeds, or operate with traditionally human interfaces like desktop applications. In some situations, this additional vision modality may process external, untrusted images, and there’s significant precedent about the attack surface of image-processing machine learning systems. In this post, we’ll apply some of these historical ideas to modern architectures to help developers understand the various threats and mitigations unlocked in the vision domain. Vision language models VLMs extend the transformer architecture popularized by large language models (LLMs) to accept both text and image input. VLMs can be finetuned to caption, detect, and segment objects, and…
87dInfra#multimodalby Joseph Lucas
87d ago
Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core
This post introduces Dynamic Context Parallelism (Dynamic-CP), a scheduling approach in NVIDIA Megatron Core used for LLM post-training or DiT pre-training. It dynamically selects the CP size per microbatch to efficiently handle variable-length sequences, achieving up to 1.48x speedup on real-world datasets. In large-scale model training, an often-overlooked bottleneck arises from the sequence-length variability in real-world datasets. Both LLM training and large-scale video generation have clear long-tail distributions in sequence length. A small fraction of ultra-long samples accounts for a disproportionately large share of the computational workload and memory consumption In LLM training, this leads to wide-ranging text sequence lengths across batches. In video generation, high-resolution, multi-second videos can span tens of thousands of tokens. This results in imbalanced sample-level FLOPs and memory usage across data-parallel ranks, modalities, and micro-batches, hindering efficient scheduling and resource utilization. To manage variable-length inputs,…
87dInfra#multimodal#training#gpuby Kunlun Li
[OAI]OpenAI Blog· 25 articlesvisit →
3d ago
Speeding up agentic workflows with WebSockets in the Responses API
Speeding up agentic workflows with WebSockets in the Responses API By Brian Yu and Ashwin Nathan, Members of the Technical Staff When you ask Codex to fix a bug, it scans through your codebase for relevant files, reads them to build context, makes edits, and runs tests to verify the fix worked. Under the hood, that means dozens of back-and-forth Responses API requests: determine the model’s next action, run a tool on your computer, send the tool output back to the API, and repeat. All of these requests can add up to minutes that users spend waiting for Codex to complete complex tasks. From a latency perspective, the Codex agent loop spends most of its time in three main stages: working in the API services (to validate and process requests), model inference, and client-side time (running tools and building model…
3dInfra#agents
4d ago
Scaling Codex to enterprises worldwide
Scaling Codex to enterprises worldwide OpenAI is launching Codex Labs and partnering with top GSIs to bring it to thousands of engineering organizations. In early April, we shared that more than 3 million developers were using Codex every week. Just two weeks later, that number has grown to more than 4 million. Beyond individual adoption, we are seeing enterprises moving quickly to roll Codex into real workflows across engineering and beyond. Companies are using Codex across the software development lifecycle. Virgin Atlantic is using it to increase test coverage and increase team velocity - reducing technical debt and improving performance. Ramp is using it to accelerate code review. Notion is using it to quickly build new features. Cisco is using it to understand and reason across large, interconnected repositories. Rakuten is using it for things like incident response. What starts…
4dInfra
9d ago
Codex for (almost) everything
We’re releasing a major update to Codex, making it a more powerful partner for the more than 3 million developers who use it every week to accelerate work across the full software development lifecycle. Codex can now operate your computer alongside you, work with more of the tools and apps you use everyday, generate images, remember your preferences, learn from previous actions, and take on ongoing and repeatable work. The Codex app also now includes deeper support for developer workflows, like reviewing PRs, viewing multiple files & terminals, connecting to remote devboxes via SSH, and an in-app browser to make it faster to iterate on frontend designs, apps, and games. With background computer use, Codex can now use all of the apps on your computer by seeing, clicking, and typing with its own cursor. Multiple agents can work on your…
24d ago
Gradient Labs gives every bank customer an AI account manager
Gradient Labs gives every bank customer an AI account manager Gradient Labs uses GPT‑4.1 and GPT‑5.4 mini and nano to run complex financial support workflows with high accuracy and low latency. Results 10x Revenue growth Results 98% Customer satisfaction with AI agent experience Results +11% Higher accuracy with GPT-4.1 vs. next-best provider In banking, resolving a customer issue is rarely simple. Cases like fraud or blocked payments require strict adherence to complex procedures across multiple teams. When systems fall short, customers are passed between teams, wait in queues, and face delays at moments when the stakes are highest. Gradient Labs(opens in a new window) is built to handle this complexity. The London-based company is building AI agents that give every bank customer the experience of a dedicated account manager. Founded by a team that previously led AI and data efforts…
24dInfra#gpt#agents
25d ago
Accelerating the next phase of AI
OpenAI raises $122 billion to accelerate the next phase of AI Today, we closed our latest funding round with $122 billion in committed capital at a post money valuation of $852 billion. OpenAI is becoming the core infrastructure for AI, making it possible for people around the world and businesses, big and small, to just build things. The broad consumer reach of ChatGPT creates a powerful distribution channel into the workplace, where demand is rapidly shifting from basic model access to intelligent systems that reshape how businesses operate. Developers build on and expand the platform by leveraging our APIs, and Codex is transforming how developers turn ideas into working software. Durable access to compute is the strategic advantage that compounds across the entire system: it advances research, improves products, expands access, and structurally lowers the cost of delivery at scale.…
25dInfra#gpt
39d ago
Introducing GPT-5.4 mini and nano
Today we’re releasing GPT‑5.4 mini and nano, our most capable small models yet. They bring many of the strengths of GPT‑5.4 to faster, more efficient models designed for high-volume workloads. GPT‑5.4 mini significantly improves over GPT‑5 mini across coding, reasoning, multimodal understanding, and tool use, while running more than 2x faster. It also approaches the performance of the larger GPT‑5.4 model on several evaluations, including SWE-Bench Pro and OSWorld-Verified. GPT‑5.4 nano is the smallest, cheapest version of GPT‑5.4 for tasks where speed and cost matter most. It is also a significant upgrade over GPT‑5 nano. We recommend it for classification, data extraction, ranking, and coding subagents that handle simpler supporting tasks. These models are built for the kinds of workloads where latency directly shapes the product experience: coding assistants that need to feel responsive, subagents that quickly complete supporting tasks,…
45d ago
Rakuten fixes issues twice as fast with Codex
Results 50% Reduction in MTTR Results 3-4x Faster potential build time for projects - from quarters to weeks Rakuten(opens in a new window) is a global innovation company operating across e-commerce, fintech, and mobile communications, serving both consumers and merchants at massive scale. With 30,000 employees worldwide, its engineering teams ship across a large, complex product ecosystem where both speed and reliability are essential. That’s why Yusuke Kaji, General Manager of AI for Business at Rakuten, has spent the past year pushing agentic workflows deeper into how teams plan, build, and validate software. Codex—the coding agent from OpenAI—has become a core part of Rakuten’s engineering stack, especially where the company needs to move faster without compromising security. Over the past year, Rakuten engineers have used Codex across operations and software delivery to compress incident response (including a ~50% reduction in…
45d ago
From model to agent: Equipping the Responses API with a computer environment
From model to agent: Equipping the Responses API with a computer environment By Bo Xu, Danny Zhang, and Rohit Arunachalam We're currently in a shift from using models, which excel at particular tasks, to using agents capable of handling complex workflows. By prompting models, you can only access trained intelligence. However, giving the model a computer environment can achieve a much wider range of use cases, like running services, requesting data from APIs, or generating more useful artifacts like spreadsheets or reports. A few practical problems emerge when you try to build agents: where to put intermediate files, how to avoid pasting large tables into a prompt, how to give the workflow network access without creating a security headache, and how to handle timeouts and retries without building a workflow system yourself. Instead of putting it on developers to build…
45dInfra#agents
46d ago
Improving instruction hierarchy in frontier LLMs
Improving instruction hierarchy in frontier LLMs Introducing IH-Challenge, a training dataset that strengthens instruction hierarchy, safety steerability, and prompt injection robustness. AI systems often receive instructions from multiple sources. These can include safety policies from system messages, product guidance from developers, requests from users, and information found online. Training models to reliably prioritize the most trusted instructions among these sources is a key part of safe deployment. Many AI safety and reliability issues can arise when this prioritization breaks down. Models may receive requests for disallowed content, attempts to reveal private information, or prompt‑injection attacks embedded in online data. Failing to behave appropriately in each of these scenarios shares the same root cause: the model may follow the wrong instruction. When these instructions conflict, the model has to decide which ones to prioritize. If it treats an untrusted instruction as…
51d ago
VfL Wolfsburg turns ChatGPT into a club-wide capability
VfL Wolfsburg turns ChatGPT into a club-wide capability By focusing on people, not pilots, the Bundesliga club is scaling efficiency, creativity, and knowledge—without losing its football identity. Results 50+ Custom GPTs in active daily use Results 1M+ Annual cost savings through reduced reliance on external agencies At VfL Wolfsburg, football is built on discipline, continuity, and trust. For nearly three decades, the club has been a constant presence in the Bundesliga—backed by strong men’s and women’s teams, a future-oriented academy, and a fast-evolving digital and commercial ecosystem. But modern football is no longer defined by performance on the pitch alone. Expectations from fans, partners, and internal stakeholders continue to rise—while budgets and headcount cannot scale indefinitely. This tension between growing expectations and limited scalability created a clear need for new ways of working. The question was how to apply it…
51dInfra#gpt
51d ago
Introducing GPT-5.4
Today, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking), the API, and Codex. It’s our most capable and efficient frontier model for professional work. We’re also releasing GPT‑5.4 Pro in ChatGPT and the API, for people who want maximum performance on complex tasks. GPT‑5.4 brings together the best of our recent advances in reasoning, coding, and agentic workflows into a single frontier model. It incorporates the industry-leading coding capabilities of GPT‑5.3‑Codex while improving how the model works across tools, software environments, and professional tasks involving spreadsheets, presentations, and documents. The result is a model that gets complex real work done accurately, effectively, and efficiently—delivering what you asked for with less back and forth. In ChatGPT, GPT‑5.4 Thinking can now provide an upfront plan of its thinking, so you can adjust course mid-response while it’s working, and arrive at a final…
51dInfra#coding
57d ago
Scaling AI for everyone
Scaling AI for everyone AI demand is surging across consumers, developers, and businesses. Meeting that demand and providing everyone access to our products requires three things: compute, distribution, and capital. Today we’re announcing $110B in new investment at a $730B pre-money valuation. This includes $30B from SoftBank, $30B from NVIDIA, and $50B from Amazon. We’ve also signed a strategic partnership with Amazon and secured next generation inference compute with NVIDIA. Additional financial investors are expected to join as the round progresses. These partnerships expand our global reach, deepen our infrastructure, and strengthen our balance sheet so we can bring frontier AI to more people, more businesses, and more communities worldwide. You can see that scale in our products. Codex brings the power of a top engineer to anyone who wants to build software. Weekly Codex users have more than tripled…
57dInfra#gpu
57d ago
OpenAI and Amazon announce strategic partnership
OpenAI and Amazon announce strategic partnership News: - Amazon Web Services (AWS) and OpenAI will co-create a Stateful Runtime Environment powered by OpenAI models, available on Amazon Bedrock for AWS customers to build generative AI applications and agents at production scale. - AWS will be the exclusive third-party cloud distribution provider for OpenAI Frontier, which enables organizations to build, deploy, and manage teams of AI agents. - OpenAI to consume 2 gigawatts of Trainium capacity through AWS infrastructure to support demand for Stateful Runtime Environment, Frontier, and other advanced workloads. - OpenAI and Amazon will develop customized models available to power Amazon’s customer-facing applications. - Amazon will invest $50 billion in OpenAI. OpenAI and Amazon (NASDAQ: AMZN) today announced a multi-year strategic partnership to accelerate AI innovation for enterprises, startups, and end consumers around the world. Amazon will also invest…
57dInfra
59d ago
Improving India’s critical care infrastructure
10BedICU 10BedICU uses OpenAI’s API to improve India’s critical care infrastructure. India faces a significant challenge in healthcare accessibility due to a high doctor-to-patient ratio, geographic barriers, and economic constraints. For instance, the ratio of oncologists to cancer patients in India is approximately 1:2,000(opens in a new window), a stark contrast to the United States’ 1:100. 10BedICU was founded as an initiative of the eGov Foundation(opens in a new window) to address these disparities. 10BedICU aims to elevate India’s critical care infrastructure, widening access to quality healthcare to India’s most underserved communities. 10BedICU is now using OpenAI models to meet the high‑stakes demands of critical‑care workflows and let clinicians reach more patients. Founder Srikanth Nadhamuni got the idea for 10BedICU during the devastating 2021 Delta wave of COVID-19, which saw over 20 million cases in just a few months. With…
59dInfra
59d ago
Stargate Infrastructure
Stargate Infrastructure OpenAI, and our strategic partners, are thrilled about our shared vision for new AI infrastructure in the United States. We are energized by the challenges we face and are excited by the prospect of partnering with firms across the industrial base to deliver against our ambitious mission. Specifically, we want to connect with firms across the built data center infrastructure landscape, from power and land to construction to equipment, and everything in between.
59dInfra#multimodal
61d ago
OpenAI announces Frontier Alliance Partners
Introducing Frontier Alliances The limiting factor for seeing value from AI in enterprises isn’t model intelligence, it’s how agents are built and run in their organizations. We recently introduced Frontier, our platform for building, deploying, and managing AI coworkers that can do real work across the enterprise. For example, an AI coworker that resolves a customer issue end-to-end by pulling context from the CRM, checking policies, filing the update, and escalating only when needed. Frontier provides the technical foundation. But making real impact with AI also requires leadership alignment, workflow redesign, integration across systems and data, as well as the kind of change management that drives adoption. Today, we’re announcing our Frontier Alliances. Boston Consulting Group (BCG)(opens in a new window) and McKinsey & Company(opens in a new window) as well as Accenture(opens in a new window) and Capgemini(opens in…
61dInfra#agents
66d ago
Introducing OpenAI for India
Introducing OpenAI for India Today at the India AI Impact Summit 2026 in Delhi, we’re launching OpenAI for India, a nationwide initiative with leading Indian partners to expand access to AI and unlock its economic and societal benefits in the world’s largest democracy. As of this month, India is home to more than 100 million weekly ChatGPT users, from students and teachers to developers and entrepreneurs. OpenAI for India builds on that momentum, working with leading partners—beginning with Tata Group—to build sovereign AI capabilities, accelerate enterprise adoption, invest in workforce upskilling, and strengthen India’s thriving AI ecosystem. As part of our global Stargate initiative, OpenAI and Tata Group are partnering to develop local, AI-ready data center capacity designed for data residency, security, and long-term domestic capability. OpenAI will become the first customer of Tata Consultancy Services’ HyperVault data center business,…
66dInfra#local
71d ago
Beyond rate limits: scaling access to Codex and Sora
Beyond rate limits: scaling access to Codex and Sora By Jonah Cohen, Member of the Technical Staff In the past year, both Codex and Sora have seen rapid adoption, with usage quickly pushing beyond what we originally expected. We’ve seen a consistent pattern: users dive in, find real value, and then run into rate limits. Rate limits can help smooth demand and ensure fair access; however, when users are getting value, hitting a hard stop can be frustrating. We wanted a way for users to keep going, while protecting system performance and user trust in our approach. To solve this, we built a real‑time access engine that counts usage. One of the layers in that engine is the ability to purchase credits. When users exceed their rate limits, credits let them keep using our products by spending down their credit…
71dInfra
75d ago
Bringing ChatGPT to GenAI.mil
Bringing ChatGPT to GenAI.mil Today, OpenAI for Government is announcing the next phase of our national security work: bringing ChatGPT to GenAI.mil, the Department of War’s secure enterprise AI platform used by 3 million civilian and military personnel. By joining the other frontier AI labs on GenAI.mil, we are building on our existing work with the Pentagon—including our collaboration with DARPA(opens in a new window) to help cyber defenders and the pilot program we announced earlier this year with the Department’s Chief Digital and Artificial Intelligence Office (CDAO) focused on how frontier AI can transform the Pentagon’s operations. We believe the people responsible for defending the country should have access to the best tools available, and it is important for the United States and other democratic countries to understand how, with the proper safeguards, AI can help protect people, deter…
75dInfra#gpt#safety
86d ago
Taisei Corporation shapes the next generation of talent with AI
Taisei Corporation shapes the next generation of talent with AI Taisei Corporation’s HR team is leading the rollout of ChatGPT Enterprise to drive AI-powered talent development across the organization. Results 3,300 Custom GPTs created Results 90% Weekly active usage of ChatGPT Enterprise Results 5.5 hrs+ Time saved per employee each week Founded in 1917, Taisei Corporation is one of Japan’s leading construction companies. For more than a century, it has delivered projects in Japan and around the world, helping to build the social infrastructure that supports modern life. Recently, a new question has come into focus: What should Taisei build next? The company began to ask whether its most important investment should be not only in buildings and infrastructure, but in people. With this in mind, Taisei’s HR organization decided to introduce ChatGPT Enterprise as a cornerstone of its talent…
86dInfra#gpt
93d ago
Scaling PostgreSQL to power 800 million ChatGPT users
Scaling PostgreSQL to power 800 million ChatGPT users By Bohan Zhang, Member of the Technical Staff For years, PostgreSQL has been one of the most critical, under-the-hood data systems powering core products like ChatGPT and OpenAI’s API. As our user base grows rapidly, the demands on our databases have increased exponentially, too. Over the past year, our PostgreSQL load has grown by more than 10x, and it continues to rise quickly. Our efforts to advance our production infrastructure to sustain this growth revealed a new insight: PostgreSQL can be scaled to reliably support much larger read-heavy workloads than many previously thought possible. The system (initially created by a team of scientists at University of California, Berkeley) has enabled us to support massive global traffic with a single primary Azure PostgreSQL flexible server instance(opens in a new window) and nearly 50…
93dInfra#gpt
95d ago
Stargate Community
Stargate Community OpenAI’s mission is to ensure that AGI benefits all of humanity, and in order to do that, we are working to ensure our Stargate campuses benefit the local communities that make them possible. We believe that AI infrastructure(opens in a new window) is vital for American competitiveness and economic opportunity, while boosting local economies by creating jobs and bringing in local revenue. When we announced Stargate one year ago in January 2025, we set out to expand our U.S. AI infrastructure to 10GW by 2029—and just one year in, we are already well beyond halfway to that goal in planned capacity, with the first site in Abilene, Texas already training and serving frontier AI systems and multiple Stargate sites under development across Texas, New Mexico, Wisconsin, and Michigan. We are committed to working with communities to ensure that…
95dInfra
95d ago
Horizon 1000: Advancing AI for primary healthcare
Horizon 1000: Advancing AI for primary healthcare Together, with the Gates Foundation, we’re committing $50 million in funding and technology to help strengthen primary healthcare for 1,000 African clinics and their communities. Editor’s Note: On behalf of The Gates Foundation, Bill Gates also shared this news on Gates Notes(opens in a new window). AI capabilities have advanced much faster than their broad, real-world deployment, leaving a growing gap between what’s possible and what people experience. These systems have become so capable that they’ve made new kinds of things possible—some we couldn’t have imagined not long ago, and some we’re still discovering. This is especially clear in healthcare, where the challenge is now turning powerful models into tools that work in everyday care. Today, we’re announcing Horizon 1000, a pilot initiative with the Gates Foundation to support leaders in African countries,…
95dInfra
97d ago
A business that scales with the value of intelligence
We launched ChatGPT as a research preview to understand what would happen if we put frontier intelligence directly in people’s hands. What followed was broad adoption and deep usage on a scale that no one predicted. More than experimenting with AI, people folded ChatGPT into their lives. Students started using it to untangle homework they were stuck on late at night. Parents started using it to plan trips and manage budgets. Writers used it to break through blank pages. More and more, people used it to understand their lives. People used ChatGPT to help make sense of health symptoms, prepare for doctor visits, and navigate complex decisions. People used it to think more clearly when they were tired, stressed, or unsure. Then they brought that leverage to work. At first, it showed up in small ways. A draft refined before…
97dInfra#gpt
100d ago
Strengthening the U.S. AI supply chain through domestic manufacturing
Strengthening the US AI supply chain through domestic manufacturing New Request for Proposals to help build and scale the infrastructure behind advanced AI. Building the infrastructure required to power advanced AI presents a historic opportunity to strengthen domestic supply chains and reindustrialize the country(opens in a new window). If we seize it, we can catalyze U.S. manufacturing, modernize our energy grid, create well-paid jobs, and strengthen American leadership. Infrastructure has long been destiny when it comes to America’s economic success, and that will be especially true in the Intelligence Age. At OpenAI, we’re committed to doing our part. Since launching our Stargate initiative almost one year ago, we’ve announced planned capacity that puts us well over halfway to meeting our 10-gigawatt commitment. These investments are already translating into good jobs and local economic growth in communities across the country. Over…
100dInfra
[PB]PyTorch Blog· 3 articlesvisit →
8d ago
Optimizing Effective Training Time for Meta’s Internal Recommendation/Ranking Workloads
Motivation and Introduction Across the industry, teams training and serving large AI models face aggressive ROI targets under tight compute capacity. As workloads scale, improving infrastructure effectiveness gets harder because end-to-end runtime increasingly includes overheads beyond “real training” (initialization, orchestration, checkpointing, retries, failures, and recovery). Meta utilizes Effective Training Time (ETT%) to quantify efficiency, defining it as the percentage of total end-to-end (E2E) wall time dedicated to productive training. This metric directly points to areas where time is wasted, thus facilitating the prioritization of efficiency improvements. In this work stream, while grounded in Meta’s production experience using PyTorch for model training, we aim to share broadly useful lessons: some improvements have been implemented in open source—e.g., TorchRec sharding plan improvements and PyTorch 2 (PT2) compilation optimizations that reduce compile time and recompilation—while others (like checkpointing and model publishing) are more…
8dInfra#inference#trainingby Ruilin Chen, Yuzhen Huang, Hang Qi, Mingming Ding, Damian Reeves, Boris Sarana, Kevin Tang, Satendra Gera, Gagan Jain, Sahil Shah, Oguz Ulgen, Mayank Garg, Meet Vadakkanchery, James March, Sophie Lin, Wei Sun
17d ago
Monarch: an API to your supercomputer
Getting distributed training jobs to run on huge clusters is hard! This is especially true when you start looking at more complex setups like distributed reinforcement learning. Debugging these kinds of jobs is frustrating, and the turnaround time for changes tends to be very slow. Monarch is a distributed programming framework for PyTorch that makes the cluster programmable through a simple Python API. It exposes the supercomputer as a coherent, directly controllable system—bringing the experience of local development to large-scale training, as if your laptop had 1000s of GPUs attached. A complete training system can be defined in a single Python program. Core primitives are explicit and minimal, enabling higher-level capabilities—fault tolerance, orchestration, tooling integration—to be built as reusable libraries. Monarch is optimized for agentic usage, providing consistent infrastructure abstractions and exposing telemetry via standard SQL-based APIs that agents already…
17dInfra#trainingby The PyTorch Team at Meta
33d ago
PyTorch 2.11 Release Blog
We are excited to announce the release of PyTorch® 2.11 (release notes)! The PyTorch 2.11 release features the following changes: - Differentiable Collectives for Distributed Training - FlexAttention now has a FlashAttention-4 backend on Hopper and Blackwell GPUs. - MPS (Apple Silicon) Comprehensive Operator Expansion - RNN/LSTM GPU Export Support - XPU Graph This release is composed of 2723 commits from 432 contributors since PyTorch 2.10. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.11. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page. On Tuesday, March 31st at 10 am, Andrey Talman and Nikita Shulga will host a live session to walk through what’s new in 2.11, including Differentiable Collectives…
33dInfra#trainingby PyTorch Foundation
[RB]Replicate Blog· 1 articlesvisit →
66d ago
Recraft V4: image generation with design taste
Recraft V4: image generation with design taste Recraft V4 is Recraft’s latest image generation model, rebuilt from the ground up. The big idea behind it is what the Recraft team calls “design taste” — the model makes visual decisions about composition, lighting, and color that feel intentional rather than generic. Images come out looking art-directed, even from simple prompts. V4 comes in four versions — two raster, two vector: All four share the same design taste and prompt accuracy. The differences are output format, resolution, and speed. Some examples These prompts are designed to push V4 into territory where most image models fall flat — complex typography layouts, precise material rendering, extreme detail at macro scale, structured vector assets, and stylized illustration with character. Typography and editorial design V4 treats text as a first-class element of composition. This prompt asks…
66dInfra#multimodal
[SWB]Simon Willison Blog· 1 articlesvisit →
2d ago
A pelican for GPT-5.5 via the semi-official Codex backdoor API
A pelican for GPT-5.5 via the semi-official Codex backdoor API 23rd April 2026 GPT-5.5 is out. It’s available in OpenAI Codex and is rolling out to paid ChatGPT subscribers. I’ve had some preview access and found it to be a fast, effective and highly capable model. As is usually the case these days, it’s hard to put into words what’s good about it—I ask it to build things and it builds exactly what I ask for! There’s one notable omission from today’s release—the API: API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale. We’ll bring GPT‑5.5 and GPT‑5.5 Pro to the API very soon. When I run my pelican benchmark I always prefer to use an API, to avoid hidden system prompts in ChatGPT…
2dInfra#gpt
[TVA]The Verge AI· 2 articlesvisit →
3d ago
Now Meta will track what employees do on their computers to train its AI agents
Meta employees’ activity at work is now being used to train the company’s AI agents. As reported by Reuters, Meta is installing a tool it calls Model Capability Initiative (MCI) on US-based employees’ computers that runs in work-related apps and websites, recording mouse movements, clicks, keystrokes, and occasional screenshots. Now Meta will track what employees do on their computers to train its AI agents The ‘Model Capability Initiative’ records mouse activity, keystrokes, and screenshots to use as AI training data. The ‘Model Capability Initiative’ records mouse activity, keystrokes, and screenshots to use as AI training data. The data from this tool will be used to train the company’s AI models to get better at interacting with computers the way humans do, including automating work tasks like those Meta’s employees perform on the job. According to Reuters, the data from MCI…
3dInfraby Stevie Bonifield
3d ago
Anthropic’s Mythos rollout has missed America’s cybersecurity agency
Several US federal agencies are taking up Anthropic’s new cybersecurity model to find vulnerabilities, but one is reportedly not getting in on the action: the nation’s central cybersecurity coordinator. Anthropic’s Mythos rollout has missed America’s cybersecurity agency CISA, embattled under the Trump administration, reportedly hasn’t gotten Anthropic’s powerful AI. CISA, embattled under the Trump administration, reportedly hasn’t gotten Anthropic’s powerful AI. On Tuesday, Axios reported that the Cybersecurity and Infrastructure Security Agency (CISA) didn’t have access to Mythos Preview, which Anthropic has touted as a powerful tool for finding and patching security vulnerabilities. Meanwhile, other agencies like Commerce Department and National Security Agency (NSA) are reportedly using the model, and President Donald Trump’s administration has been negotiating broader access, Axios wrote last week. In a blog post, Anthropic said it’s “been in ongoing discussions with US government officials about Claude…
3dInfraby Lauren Feiner
[VB]vLLM Blog· 1 articlesvisit →
11d ago
vLLM Korea Meetup 2026 Wrap-Up Apr 14, 2026 · 7 min read Hosted by the vLLM KR Community, with support from Rebellions, SqueezeBits, Red Hat APAC, and PyTorch Korea, the vLLM Korea Meetup 2026 was held in Seoul on April 2nd.
vLLM Korea Meetup 2026 Wrap-Up Hosted by the vLLM KR Community, with support from Rebellions, SqueezeBits, Red Hat APAC, and PyTorch Korea, the vLLM Korea Meetup 2026 was held in Seoul on April 2nd. This meetup proved to be much more than a standard tech event. Not only did it see strong turnout on the day, but the post-event survey recorded an impressive ~75% response rate — a testament to the active engagement of the attendees. Results reflected high overall satisfaction, confirming that the meetup delivered both in-depth practical content and a genuine community experience. Field engineers from a wide range of companies and research institutions gathered to share real-world deployment stories and infrastructure strategies for running LLMs in production. As AI moves beyond the research phase and into full-scale services, handling inference workloads efficiently has become a central challenge.…
11dInfra#inference
[WA]Wired AI· 2 articlesvisit →
3d ago
AI Tools Are Helping Mediocre North Korean Hackers Steal Millions
The advent of AI hacking tools has raised fears of a near future in which anyone can use automated tools to dig up exploitable vulnerabilities in any piece of software, like a kind of digital intrusion superpower. Here in the present, however, AI seems to be playing a more mundane, if still concerning, role in hackers’ toolkit: It’s helping mediocre hackers level up and carry out broad, effective malware campaigns. That includes one group of relatively unskilled North Korean cybercriminals who’ve been discovered using AI to carry out virtually every part of an operation that hacked thousands of victims to steal their cryptocurrency. On Wednesday, cybersecurity firm Expel revealed what it describes as a North Korean state-sponsored cybercrime operation that installed credential-stealing malware on more than 2,000 computers, specifically targeting the machines of developers working on small cryptocurrency launches, NFT…
3dInfra#codingby Andy Greenberg, Matt Burgess
3d ago
5 AI Models Tried to Scam Me. Some of Them Were Scary Good
I recently witnessed how scary-good artificial intelligence is getting at the human side of computer hacking, when the following message popped up on my laptop screen: Hi Will, I’ve been following your AI Lab newsletter and really appreciate your insights on open-source AI and agent-based learning—especially your recent piece on emergent behaviors in multi-agent systems. I’m working on a collaborative project inspired by OpenClaw, focusing on decentralized learning for robotics applications. We’re looking for early testers to provide feedback, and your perspective would be invaluable. The setup is lightweight—just a Telegram bot for coordination—but I’d love to share details if you’re open to it. The message was designed to catch my attention by mentioning several things I am very into: decentralized machine learning, robotics, and the creature of chaos that is OpenClaw. Over several emails, the correspondent explained that his…
3dInfra#agents#open-sourceby Will Knight