★ TOP STORY[ NV ]Research·2d ago

Winning a Kaggle Competition with Generative AI–Assisted Coding

In March 2026, three LLM agents generated over 600,000 lines of code, ran 850 experiments, and helped secure a first-place finish in a Kaggle playground competition. Success in modern machine learning competitions is increasingly defined by how quickly you can generate, test, and iterate on ideas. LLM agents, combined with GPU acceleration, dramatically compress this loop. Historically, two bottlenecks have limited this experimentation: - How quickly you can write code for new experiments. - How quickly you can execute those experiments. GPUs and libraries like NVIDIA cuDF, NVIDIA cuML, XGBoost, and PyTorch have largely solved the second problem. LLM agents now address the first problem—unlocking a new scale of rapid, iterative experimentation. This blog post describes how I used LLM agents to accelerate the discovery of the most performant tabular data prediction solutions. Case study: Kaggle Playground churn prediction The…

NVIDIA Developer Blogread →

▲ trending · last 48hview all →

▾[NV]NVIDIA Developer Blog· 99 articlesvisit →

3d ago

Simplify Sparse Deep Learning with Universal Sparse Tensor in nvmath-python

In a previous post, we introduced the Universal Sparse Tensor (UST), enabling developers to decouple a tensor’s sparsity from its memory layout for greater flexibility and performance. We’re excited to announce the integration of the UST into nvmath-python v0.9.0 to accelerate sparse scientific and deep learning applications. This post provides a walkthrough of key UST features, implementation details, and performance overview, including: - Zero-cost interoperability: Data-movement-free conversion with PyTorch, SciPy, and CuPy. - Custom formats: Define novel sparsity schemes. - Polymorphic operations: Sparsity-agnostic functions automatically use optimized kernels or generate custom sparse code—eliminating the need for manual coding of new formats. - PyTorch injection: Easily inject UST performance benefits into existing PyTorch models. - Transparent caching: Avoid JIT/LTO recompilation and replanning—amortizing overhead over subsequent repeated execution of the same operation. Tensor format DSL The UST describes common (e.g., COO, CSR,…

3dTutorial#codingby Aart J.C. Bik

3d ago

Scaling the AI-Ready Data Center with NVIDIA RTX PRO 4500 Blackwell Server Edition and NVIDIA vGPU 20

AI integration is redefining mainstream enterprise applications, from productivity software like Microsoft Office to more complex design and engineering tools. This shift requires the modern data center to move beyond single-purpose silos. For developers, gaining access to dedicated GPU compute can often be a bottleneck. Virtual machines (VMs) solve part of this challenge by providing secure, isolated, and scalable environments tailored to specific project needs. However, dedicating an entire physical GPU to a single VM is highly inefficient for mixed or lightweight workloads. This is where NVIDIA Multi-Instance GPU (MIG) technology becomes essential. With MIG, a single physical GPU is partitioned at the hardware level into multiple fully independent instances, each with guaranteed memory, cache, and compute cores. For a development team, this ensures predictable, uncompromising Quality of Service (QoS). This means that multiple developers can simultaneously train AI models,…

3dHardware#gpuby Phoebe Lee

3d ago

Advancing Emerging Optimizers for Accelerated LLM Training with NVIDIA Megatron

Higher-order optimization algorithms such as Shampoo have been effectively applied in neural network training for at least a decade. These methods have achieved significant success more recently when applied to leading LLMs. In particular, Muon (MomentUm Orthogonalized by Newton-Schulz) was used to train some of today’s best open source models, including Kimi K2 and GLM-5. This post explains how NVIDIA provides comprehensive support for Muon and other cutting-edge emerging optimizers and the technologies enabling them to train large-scale models. Muon training performance on NVIDIA GB300 NVL72 Table 1 summarizes training throughput of the Kimi K2 and Qwen3 30B models with Muon and the AdamW optimizer on the NVIDIA GB300 NVL72 system. With the technologies that will be introduced in the next section, the results show that there is a very small training performance loss using the Muon optimizer compared to…

3dInfra#qwen#inference#observability#trainingby Hao Wu

5d ago

Maximizing Memory Efficiency to Run Bigger Models on NVIDIA Jetson

The boom in open source generative AI models is pushing beyond data centers into machines operating in the physical world. Developers are eager to deploy these models at the edge, enabling physical AI agents and autonomous robots to automate heavy-duty tasks. A key challenge is efficiently running multi-billion-parameter models on edge devices with limited memory. With ongoing constraints on memory supply and rising costs, developers are focused on achieving more with less. The NVIDIA Jetson platform supports popular open models while delivering strong runtime performance and memory optimization at the edge. For edge developers, the memory footprint determines whether a system functions. Unlike cloud environments, edge devices operate under strict memory limits, with CPU and GPU sharing constrained resources. Inefficient memory use can lead to bottlenecks, latency spikes, or system failure. Meanwhile, modern edge applications often run multiple pipelines—such as…

5dOpen Source#coding#open-source#gpuby Anshuman Bhat

5d ago

Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision

As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy Optimization (GRPO) power this transition, enabling reasoning-grade models to continuously improve through iterative feedback. Unlike standard supervised fine-tuning, RL training loops are bifurcated into two distinct, high-intensity phases: a generation phase with a stringent latency requirement and a training phase requiring high throughput. To make these workloads viable, researchers and engineers are turning to low-precision datatypes like FP8 to boost performance in training and throughput-oriented generation. Moreover, in some scenarios where generation is bound by GPU memory bandwidth, using low-precision parameters can improve performance due to fewer bytes per parameter. This post dives deep into the systemic challenges of low-precision RL and how NVIDIA NeMo RL—an open source library within the NVIDIA NeMo framework—speeds up RL workloads while…

5dInfra#inference#trainingby Guyue Huang

5d ago

Mitigating Indirect AGENTS.md Injection Attacks in Agentic Environments

AI tools are significantly accelerating software development and changing how developers work with code. These tools serve as real-time copilots, automating repetitive tasks, executing tasks, writing documentation, and more. OpenAI Codex, for example, is a coding agent designed to assist developers through tasks like code generation, debugging, and automated pull request (PR) creation. Yet as agentic tools are integrated into workflows, how they affect the safety, reliability, and integrity of software development must be considered. A recent Codex vulnerability discovered by the NVIDIA AI Red Team highlights security gaps from indirect AGENTS.md injection through malicious dependencies. While this attack relies on a compromised dependency, meaning the attacker already has a form of code execution, it illustrates a new dimension of supply chain risk unique to agentic development environments. This post walks through the attack chain step-by-step—from dependency setup to instruction…

5dAgents#agents#codingby Daniel Teixeira

8d ago

Full-Stack Optimizations for Agentic Inference with NVIDIA Dynamo

Coding agents are starting to write production code at scale. Stripe’s agents generate 1,300+ PRs per week. Ramp attributes 30% of merged PRs to agents. Spotify reports 650+ agent-generated PRs per month. Tools like Claude Code and Codex make hundreds of API calls per coding session, each carrying the full conversation history. Behind every one of these workflows is an inference stack under significant KV cache pressure. Lets take Claude Code as an example. After the first API call that writes the conversation prefix to KV cache, every subsequent call to the same worker hits 85-97% cache. Agent teams (or swarms) push this further with 97.2% aggregate cache hit rate across 4 Opus teammates. An 11.7x read/write ratio means the system reads from cache nearly 12 times for every token it writes. This is a write-once-read-many (WORM) access pattern: the…

8dAgents#agents#inference#coding#gpuby Ishan Dhanani

8d ago

Build a More Secure, Always-On Local AI Agent with OpenClaw and NVIDIA NemoClaw

Agents are evolving from question-and-answer systems into long-running autonomous assistants that read files, call APIs, and drive multi-step workflows. However, deploying an agent to execute code and use tools without proper isolation raises real risks—especially when using third-party cloud infrastructure due to data privacy and control. NVIDIA NemoClaw is an open-source reference stack that orchestrates NVIDIA OpenShell to run OpenClaw, a self-hosted gateway that connects messaging platforms to AI coding agents powered by open models like NVIDIA Nemotron. NemoClaw adds guided onboarding, lifecycle management, image hardening, and a versioned blueprint, providing a complete pipeline from model inference to more secure, interactive agent deployment. This tutorial walks through a NemoClaw deployment on NVIDIA DGX Spark—from configuring the runtime environment and serving the model locally, to installing the NemoClaw stack and connecting it to Telegram for remote access. You’ll build a local,…

8dAgents#agents#local#gpuby Patrick Moorhead

8d ago

Accelerate Clean, Modular, Nuclear Reactor Design with AI Physics

The development of socially acceptable nuclear reactors requires that they are safe, clean, efficient, economical, and sustainable. Meeting these requirements calls for new approaches, driving growing interest in Small Modular Reactors (SMRs) and in Generation IV designs. SMRs aim to improve project economics by standardising designs and shifting construction to controlled manufacturing environments, while Gen IV reactors target fundamental fuel-cycle challenges by better managing transuranics and reducing the radiotoxicity and longevity of waste. Together, these approaches offer a credible roadmap toward safer, cleaner, and more sustainable nuclear energy. However, validating new designs presents significant challenges. Due to the expense, time constraints, and inherent complexities of physical experiments, numerical simulations are fundamental to the design of nuclear reactors. Yet, the high computational cost of these simulations often creates a major bottleneck in the design process, slowing the pace of innovation. To…

8dResearchby Mark Hobbs

9d ago

How to Build Vision AI Pipelines Using NVIDIA DeepStream Coding Agents

Developing real-time vision AI applications presents a significant challenge for developers, often demanding intricate data pipelines, countless lines of code, and lengthy development cycles. NVIDIA DeepStream 9 removes these development barriers using coding agents, such as Claude Code or Cursor, to help you easily create deployable, optimized code that brings your vision AI applications to life faster. This new approach simplifies the process of building complex multi-camera pipelines that ingest, process, and analyze massive volumes of real-time video, audio, and sensor data. Built on GStreamer and part of the NVIDIA Metropolis vision AI development platform, DeepStream accelerates a developer’s journey from concept to actionable insight across industries. Video 1. How to use the NVIDIA DeepStream coding agents to generate complete vision AI pipelines from natural language prompts with Claude Code. To watch a recording showing how to build a DeepStream…

9dTutorial#multimodal#coding#gpuby Debraj Sinha

11d ago

Building Custom Atomistic Simulation Workflows for Chemistry and Materials Science with NVIDIA ALCHEMI Toolkit

For decades, computational chemistry has faced a tug-of-war between accuracy and speed. Ab initio methods like density functional theory (DFT) provide high fidelity but are computationally expensive, limiting researchers to systems of a few hundred atoms. Conversely, classical force fields are fast but often lack the chemical accuracy required for complex bond-breaking or transition-state analysis. Machine learning interatomic potentials (MLIPs) have emerged as the bridge, offering quantum accuracy at classical speeds. However, the software ecosystem is a new bottleneck. While the MLIP models themselves run on GPUs, the surrounding simulation infrastructure often relies on legacy CPU-centric code. NVIDIA ALCHEMI (AI Lab for Chemistry and Materials Innovation) helps to address these challenges by accelerating chemicals and materials discovery with AI. We have previously announced two components of the ALCHEMI portfolio: - ALCHEMI NIM microservices: Scalable, cloud‑ready microservices for AI-accelerated batched atomistic…

11dResearch#agents#gpuby Erica Tsai

11d ago

NVIDIA NVbandwidth: Your Essential Tool for Measuring GPU Interconnect and Memory Performance

When you’re writing CUDA applications, one of the most important things you need to focus on to write great code is data transfer performance. This applies to both single-GPU and multi-GPU systems alike. One of the tools you can use to understand the memory characteristics of your GPU system is NVIDIA NVbandwidth. In this blog post, we’ll explore what NVbandwidth is, how it works, its key features, and how you can use it to test and evaluate your own NVIDIA GPU systems. This post is intended for CUDA developers, system architects, and ML infrastructure engineers who need to measure and validate GPU interconnect performance. What is NVbandwidth? NVbandwidth is a CUDA-based tool that measures bandwidth and latency for various memory copy patterns across different links using either copy engine (CE) or kernel copy methods. It reports the current measured bandwidth…

11dHardware#coding#gpuby Eva Sitaridi

11d ago

NVIDIA Ising Introduces AI-Powered Workflows to Build Fault-Tolerant Quantum Systems

NVIDIA Ising is the world’s first family of open AI models for building quantum processors, launching with two model domains: Ising Calibration and Ising Decoding. Both target the fundamental challenge in quantum computing—qubits are inherently noisy. The best quantum processors make an error roughly once in every thousand operations. To become useful accelerators for scientific and enterprise problems, error rates must drop to one in a trillion or better. AI is the most promising path to closing that gap at scale. Calibration is the process of understanding the noise in each quantum processor and tuning it to achieve the best possible performance. Calibration minimizes error, but because of noise in quantum systems, errors must be corrected in real time by a classical computer, faster than they accumulate. This process is called quantum error correction decoding. Both calibration and decoding are…

11dHardware#agents#coding#gpuby Tom Lubowe

13d ago

MiniMax M2.7 Advances Scalable Agentic Workflows on NVIDIA Platforms for Complex AI Applications

The release of MiniMax M2.7 adds enhancements to the popular MiniMax M2.5 model, built for agentic harnesses, and other complex use cases in fields such as reasoning, ML research workflows, software, engineering, and office work. The open weights release of MiniMax M2.7 is now available through NVIDIA and across the open source inference ecosystem. The MiniMax M2 series is a sparse mixture-of-experts (MoE) model family designed for efficiency and capability. The MoE design keeps inference costs low while preserving the full capacity of a 230B-parameter model. It uses multi-head causal self-attention enhanced with Rotary Position Embeddings (RoPE) and Query-Key Root Mean Square Normalization (QK RMSNorm) for stable training at scale. A top-k expert routing mechanism ensures that only the most relevant experts activate for any given input, keeping inference costs low despite the model’s large total parameter count. The result…

13dAgents#agents#gpuby Anu Srivastava

16d ago

Running Large-Scale GPU Workloads on Kubernetes with Slurm

Slurm is an open source cluster management and job scheduling system for Linux. It manages job scheduling for over 65% of TOP500 systems. Most organizations running large-scale AI training have years of investment in Slurm job scripts, fair-share policies, and accounting workflows. The challenge is getting Slurm scheduling capabilities onto Kubernetes—the standard platform for managing GPU infrastructure at scale—without maintaining two separate environments. Slinky, an open source project developed by SchedMD (now part of NVIDIA), takes two approaches to this integration: - slurm-bridge brings Slurm scheduling to native Kubernetes workloads, allowing Slurm to act as a Kubernetes scheduler for pods - slurm-operator runs full Slurm clusters on Kubernetes infrastructure, managing the complete lifecycle of Slurm daemons as pods This post focuses on the slurm-operator, which is how NVIDIA runs Slurm on Kubernetes for large-scale GPU training clusters. It walks through…

16dHardware#open-sourceby Anton Polyakov

16d ago

Cut Checkpoint Costs with About 30 Lines of Python and NVIDIA nvCOMP

Training LLMs requires periodic checkpoints. These full snapshots of model weights, optimizer states, and gradients are saved to storage so training can resume after interruptions. At scale, these checkpoints become massive (782 GB for a 70B model) and frequent (every 15-30 minutes), generating one of the largest line items in a training budget. Most AI teams chase GPU utilization, training throughput, and model quality. Almost none look at what checkpointing is costing them. This is an expensive oversight. The synchronous checkpoint overhead of a 405B model on 128 NVIDIA Blackwell GPUs alone can cost $200,000 a month. By introducing a lossless compression step implemented with about 30 lines of Python, we can reduce storage costs by $56,000 every month. Mixture of experts (MoE) models save even more. We’ll break down how we got to that calculation and how NVIDIA nvComp…

16dModel#rag#training#gpuby Wenqi Glantz

16d ago

How to Accelerate Protein Structure Prediction at Proteome-Scale

Proteins rarely function in isolation as individual monomers. Most biological processes are governed by proteins interacting with other proteins, forming protein complexes whose structures are described in the hierarchy of protein structure as the quaternary representation. This represents one level of complexity up from tertiary representations, the 3D structure of monomers, which are commonly known since the emergence of AlphaFold2 and the creation of the Protein Data Bank. Structural information for the vast majority of complexes remains unavailable. While the AlphaFold Protein Structure Database (AFDB), jointly developed by Google DeepMind and EMBL’s European Bioinformatics Institute (EMBL-EBI), transformed access to monomeric protein structures, interaction-aware structural biology at the proteome scale has remained a bottleneck with unique challenges: - Massive combinatorial interaction space - High computational cost for multiple sequence alignment (MSA) generation and protein folding - Inference scaling across millions of…

16dTutorialby Christian Dallago

17d ago

Integrate Physical AI Capabilities into Existing Apps with NVIDIA Omniverse Libraries

Physical AI—AI systems that perceive, reason, and act in physically grounded simulated environments—is changing how teams design and validate robots and industrial systems, long before anything ships to the factory floor. At GTC 2026, NVIDIA highlighted physical AI as a key direction for robotics and digital twins, where policies are trained and validated against physically grounded environments. To make NVIDIA Omniverse easier to integrate into existing applications, NVIDIA is adding a modular, library‑based architecture alongside the existing platform. Core Omniverse components—RTX rendering, PhysX‑based simulation, and data storage pipelines—are being exposed as standalone, headless‑first C APIs with C++ and Python bindings: ovrtx , ovphysx , and ovstorage . For developers with established stacks, these libraries reduce the need for major architectural rewrites and let you integrate Omniverse capabilities without adopting the full Omniverse container stack. Delivering value through modular simulation In…

17d#rag#coding#gpuby Ashley Goldstein

18d ago

Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling

The NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72 systems, featuring NVIDIA Blackwell architecture, are rack-scale supercomputers. They’re designed with 18 tightly coupled compute trays, massive GPU fabrics, and high-bandwidth networking packaged as a unit. For AI architects and HPC platform operators, the challenge isn’t just racking and stacking hardware—it’s turning infrastructure into safe, performant, and easy-to-use resources for end users. The mismatch between rack-scale hardware topology and scheduler abstractions is where most of the operational complexity lives. Left unaddressed, schedulers operate on a flat pool of GPUs and nodes, overlooking the system’s hierarchical and topology-sensitive design. This is the gap that a validated software stack, such as NVIDIA Mission Control, is designed to bridge. Mission Control provides rack-scale control planes for NVIDIA Grace Blackwell NVL72 systems. With a native understanding of NVIDIA NVLink and NVIDIA IMEX domains, it integrates with…

18dHardware#gpuby Ryan Prout

23d ago

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU scheduling. In the previous post, Build High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6, this was described as the data-to-tensor gap—a performance mismatch between AI pipeline stages. The SMPTE VC-6 (ST 2117-1) codec addresses this gap through a hierarchical, tile-based architecture. Images are encoded as progressively refinable Levels of Quality (LoQs), each adding incremental detail. This enables selective retrieval and decoding of only the required resolution, region of interest, or color plane, with random access to independently decodable frames. Pipelines can retrieve and decode only what the model needs. However, efficient single-image execution does not automatically translate to efficient scaling. As batch sizes grow, the bottleneck shifts from single-image kernel efficiency to workload orchestration, launch cadence, and GPU occupancy.…

23dHardware#inference#multimodal#gpuby Andreas Kieslinger

23d ago

Bringing AI Closer to the Edge and On-Device with Gemma 4

The Gemmaverse expands with the launch of the latest Gemma 4 multimodal and multilingual models, designed to scale across the full spectrum of deployments, from NVIDIA Blackwell in the data center to Jetson at the edge. These models are suited to meet the growing demand for local deployment for AI development and prototyping, secure on-prem requirements, cost efficiency, and latency-sensitive use cases. The newest generation improves both efficiency and accuracy, making these general-purpose models well-suitable for a wide range of common tasks: - Reasoning: Strong performance on complex problem-solving tasks. - Coding: Code generation and debugging for developer workflows. - Agents: Native support for structured tool use (function calling). - Vision, video and audio capability: Enables rich multimodal interactions for use cases such as object recognition, automated speech recognition (ASR), document and video intelligence, and more. - Interleaved multimodal input:…

23dInfra#multimodal#localby Anu Srivastava

23d ago

Achieving Single-Digit Microsecond Latency Inference for Capital Markets

In algorithmic trading, reducing response times to market events is crucial. To keep pace with high-speed electronic markets, latency-sensitive firms often use specialized hardware like FPGAs and ASICs. Yet, as markets grow more efficient, traders increasingly depend on advanced models such as deep neural networks to enhance profitability. Because implementing these complex models on low-level hardware requires significant investment, general-purpose GPUs offer a practical, cost-effective alternative. The NVIDIA GH200 Grace Hopper Superchip in the Supermicro ARS-111GL-NHR server has achieved single-digit microsecond latencies in the STAC-ML Markets (Inference) benchmark, Tacana suite (audited by STAC), providing performance comparable to or better than specialized hardware systems. This post details these record-breaking results and provides a deep dive into the custom-tailored solutions required for low-latency GPU inference. It also walks you through an open source reference implementation and a tutorial for getting started. STAC-ML…

23dInfra#inferenceby Nikolay Markovskiy

24d ago

CUDA Tile Programming Now Available for BASIC!

Note: CUDA Tile Programming in BASIC is an April Fools’ joke, but it’s also real and actually works, demonstrating the flexibility of CUDA. CUDA 13.1 introduced CUDA Tile, a next generation tile-based GPU programming paradigm designed to make fine-grained parallelism more accessible and flexible. One of its key strengths is language openness: any programming language can target CUDA Tile, enabling developers to bring tile-based GPU acceleration into a wide range of ecosystems. In response to overwhelming demand from seasoned developers everywhere, we’re releasing cuTile BASIC for GPUs, bringing CUDA Tile programming to this long-overlooked language. What is cuTile BASIC? cuTile BASIC is an expression of the CUDA Tile programming model in BASIC, built on top of the CUDA Tile IR specification. It enables you to write tile kernels in BASIC using a tile-based model, which is a natural fit for…

24dHardware#coding#gpuby Rob Armstrong

24d ago

NVIDIA Platform Delivers Lowest Token Cost Enabled by Extreme Co-Design

Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak chip specifications. Rigorous AI inference performance benchmarks are critical to understanding real-world token output, which drives AI factory revenue. MLPerf Inference v6.0 is the latest in a series of industry benchmarks that measure performance across a wide range of model architectures and use cases. In this latest round, systems powered by NVIDIA Blackwell Ultra GPUs delivered the highest throughput across the widest range of models and scenarios. This brings the cumulative NVIDIA MLPerf training and inference wins since 2018 to 291, which is 9x of all other submitters combined. This round, the NVIDIA partner ecosystem participated broadly, with 14 partners—the largest number of partners submitting on any platform. ASUS, Cisco, CoreWeave, Dell Technologies, GigaComputing, Google Cloud,…

24dHardware#inference#gpuby Ashraf Eassa

24d ago

Accelerate Token Production in AI Factories Using Unified Services and Real-Time AI

In today’s AI factory environment, performance is not theoretical. It is economic, competitive, and existential. A 1% drop in usable GPU time can mean millions of tokens lost per hour. Minutes of congestion can cascade into hours of recovery. A rack-level power oversubscription can lead to stranded power and reduced tokens per watt, silently eroding factory output at scale. As AI factories scale to thousands of GPUs running diverse mission critical workloads, the cost of unpredictable congestion, power constraints, long-tail latency, and limited visibility grows exponentially. Operations teams and administrators need more than dashboards. They need flexibility and foresight. NVIDIA launched NVIDIA Mission Control as an integrated software stack for AI factories built on NVIDIA reference architectures, codifying NVIDIA best practices with a unified control plane. Mission Control version 3.0 expands further, introducing architectural flexibility, multi-org isolation, intelligent power orchestration…

24dHardwareby Pradyumna Desale

25d ago

Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

Spatial computing is moving from visualization to active collaboration, adding increasingly more GPU demands on XR hardware to render photorealistic, physics-accurate, high-fidelity spatial content in real time. Meanwhile, developers have had to maintain separate codebases for every platform, each with different toolchains, SDKs, and streaming protocols. At NVIDIA GTC 2026, NVIDIA CloudXR 6.0 introduced a universal OpenXR-based streaming runtime that works across headsets, operating systems, and browsers—including native visionOS integration. This post walks through how the CloudXR 6.0 architecture works and how to start building today. CloudXR 6.0: Universal OpenXR streaming The release focuses on expanding the reach of NVIDIA RTX-powered content to any spatial display without the constraints of local hardware or manual device provisioning. Native spatial streaming for Apple platforms NVIDIA and Apple have collaborated to build a high-performance bridge for Apple Vision Pro using privacy-protected foveated streaming…

25dHardware#gpuby Max Bickley

25d ago

Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

Delivering high-fidelity VR and AR experiences to enterprise users has typically required native application development, custom device management, and complex deployment pipelines. Now, with the new JavaScript SDK NVIDIA CloudXR.js, developers can stream GPU-rendered immersive content directly to a standard web browser—no app store, no installs, no device-specific builds. NVIDIA CloudXR.js brings the full power of NVIDIA RTX remote rendering to the web platform. This is a fundamental shift in how immersive applications are built and delivered. NVIDIA CloudXR.js expands access to enterprise XR beyond native development workflows and into the broad web developer community. Developers building digital twins in NVIDIA Omniverse, robot teleoperation systems, or interactive 3D training environments can now reach users on XR headsets through a URL. This post walks through the SDK architecture, its core API, and how to connect it to server applications such as…

25dTutorial#agents#coding#training#gpuby Yanzi Zhu

31d ago

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition (ASR) or text-to-speech (TTS) models may require only 10 GB of VRAM, yet occupy an entire GPU in standard Kubernetes deployments. Because the scheduler maps a model to one or more GPUs and can’t easily share across GPUs across models, expensive compute resources often remain underutilized. Solving this isn’t just about cost reduction—it’s about optimizing cluster density to serve more concurrent users on the same world-class hardware. This guide details how to implement and benchmark GPU partitioning strategies, specifically NVIDIA Multi-Instance GPU (MIG) and time-slicing to fully use compute resources. Using a production-grade voice AI pipeline as our testbed, we show how to combine models to maximize infrastructure ROI while maintaining >99% reliability and strict latency guarantees. Addressing GPU resource fragmentation By…

31dHardware#inferenceby Sagar Desai

31d ago

How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

In the current state of automotive radar, machine learning engineers can’t work with camera-equivalent raw RGB images. Instead, they work with the output of radar constant false alarm rate (CFAR), which is similar to computer vision (CV) edge detections. The communications and compute architectures haven’t kept pace with trends in AI and the needs of Level 4 autonomy, despite radar being a staple of vehicle‑level sensing for years. The real 3D/4D “image” signal is instead processed inside the edge device. The radar outputs objects, or in some cases point clouds, which is similar to a camera outputting a classical CV Canny edge‑detection image. Centralized radar processing on NVIDIA DRIVE changes this model: Raw analog‑to‑digital converter (ADC) data moves into a centralized compute platform. From there, a software-defined pipeline accelerated by dedicated NVIDIA Programmable Vision Accelerator (PVA) hardware handles everything from…

31dHardware#gpuby Lachlan Dowling

31d ago

Designing Protein Binders Using the Generative Model Proteina-Complexa

Developing new protein-based therapies and catalysts involves the challenging task of designing protein binders, or proteins that bind to a target protein or small molecule. The search space for possible amino acid sequence permutations and resulting 3D protein structures for a designed binder is vast, and achieving strong, specific binding requires careful optimization of the interactions between the protein binder and the target. To address these challenges, NVIDIA has released Proteina-Complexa, a generative model that designs de novo protein binders and enzymes. In this post, we detail the key technologies behind Proteina-Complexa, explore primary use cases, and highlight the extensive experimental validation of generated protein binders. We also provide a step-by-step guide for using the command-line interface to generate your own binders. Key technologies in Proteina-Complexa Proteina-Complexa performance relies on three distinct technical components: the base generative model, the training…

31dTutorial#training#gpuby Kyle Gion

31d ago

Scaling Token Factory Revenue and AI Efficiency by Maximizing Performance per Watt

In the AI era, power is the ultimate constraint, and every AI factory operates within a hard limit. This makes performance per watt—the rate at which power is converted into revenue-generating intelligence—the defining metric for modern AI infrastructure. AI data centers now operate as token factories tied directly to the energy ecosystem, where access to land, power, and shell determines deployment, and efficiency determines output. Increasing revenue within a fixed power envelope depends entirely on maximizing intelligence per watt across AI infrastructure and across the five-layer AI cake ecosystem. This post walks through how NVIDIA architectures, systems, and AI factory software maximize performance per watt at every layer of the stack, and how those efficiency gains translate into higher token throughput and revenue per megawatt. Compounding performance per watt across NVIDIA GPU architectures NVIDIA architectures and platforms are engineered to…

31dInfraby Kibibi Moseley

32d ago

Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety

Agentic AI is an ecosystem where specialized models work together to handle planning, reasoning, retrieval, and safety guardrailing. As these systems scale, developers need models that can understand real-world multimodal data, converse naturally with users globally, and operate safely across languages and modalities. At GTC 2026, NVIDIA introduced a new generation of NVIDIA Nemotron models designed to work together as a unified agentic stack: - NVIDIA Nemotron 3 Super for long-context reasoning and agentic tasks - NVIDIA Nemotron 3 Ultra (coming soon) for highest reasoning accuracy and efficiency among open frontier models - NVIDIA Nemotron 3 Content Safety for multimodal, multilingual content moderation - NVIDIA Nemotron 3 VoiceChat (in early access) for low latency, natural, full-duplex voice interactions - NVIDIA Nemotron 3 Nano Omni (coming soon) for enterprise-grade multimodal understanding - NVIDIA Nemotron RAG for generating embeddings for image and…

32dInfra#rag#agents#multimodal#gpuby Chintan Patel

33d ago

NVIDIA IGX Thor Powers Industrial, Medical, and Robotics Edge AI Applications

Industrial and medical systems are rapidly increasing the use of high-performance AI to improve worker productivity, human-machine interaction, and downtime management. From factory automation cells to autonomous mobile platforms to surgical rooms, operators are deploying increasingly complex generative AI models, more sensors, and higher‑fidelity data streams at the edge. Safety and regulatory compliance are meanwhile crucial to ensure deterministic behavior, high availability, and verifiable functional safety essential design requirements. This post introduces NVIDIA IGX Thor, a platform built for the demands of powering industrial AI at the edge, including a deep dive into performance and safety features. What is NVIDIA IGX Thor? NVIDIA IGX Thor is an enterprise-ready platform for physical AI. It offers server‑class AI performance together with industrial-grade hardware, advanced functional safety capabilities, extended lifecycle support, and an enterprise software stack in configurations suitable for industrial and medical…

33dHardware#agents#gpu#safetyby Suhas Hariharapura Sheshadri

33d ago

Building a Zero-Trust Architecture for Confidential AI Factories

AI is moving from experimentation to production. However, most data enterprises need exists outside the public cloud. This includes sensitive information like patient records, market research, and legacy systems containing enterprise knowledge. There’s also a risk of using private data with AI models, and adoption is often slowed or blocked by privacy and trust concerns. Enterprises building next-generation AI factories—specializing in high-performance infrastructure to manufacture intelligence at scale—must be built on a zero-trust foundation. This security architecture eliminates implicit trust in the underlying host infrastructure by using hardware-enforced Trusted Execution Environments (TEEs) and cryptographic attestation. This post describes the full-stack architecture needed to integrate the zero-trust foundation into AI factories. On-premise requirements often limit enterprises to building their own models or using open source models for agentic AI workloads. To deliver on the promise of AI, organizations must deploy a…

33dResearchby Hema Bontha

33d ago

Deploying Disaggregated LLM Inference Workloads on Kubernetes

As large language model (LLM) inference workloads grow in complexity, a single monolithic serving process starts to hit its limits. Prefill and decode stages have fundamentally different compute profiles, yet traditional deployments force them onto the same hardware, leaving GPUs underutilized and scaling inflexible. Disaggregated serving addresses this by splitting the inference pipeline into distinct stages such as prefill, decode, and routing, each running as an independent service that can be resourced and scaled on its own terms. This post will give an overview of how disaggregated inference gets deployed on Kubernetes, explore different ecosystem solutions and how they execute on a cluster, and evaluate what they provide out of the box. How do aggregated and disaggregated inference differ? Before diving into Kubernetes manifests, it helps to understand the two inference deployment modes for LLMs: In aggregated serving, a single…

33dInfra#inference#codingby Anish Maddipoti

38d ago

How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

While consumer AI offers powerful capabilities, workplace tools often suffer from disjointed data and limited context. Built with LangChain, the NVIDIA AI-Q blueprint is an open source template that bridges this gap. LangChain recently introduced an enterprise agent platform built with NVIDIA AI to support scalable, production-ready agent development. This tutorial, available as an NVIDIA launchable, shows developers how to use the AI-Q blueprint to create a deep research agent that tops leaderboards and can be connected to enterprise systems. The blueprint uses the best of open and frontier LLMs, is optimized using the NVIDIA NeMo Agent Toolkit, and monitored with LangSmith. The result: faster time-to-production for agentic search apps that keep business data exactly where it belongs—private and in a secure environment. The NVIDIA AI-Q blueprint and NeMo Agent Toolkit are both part of the broader NVIDIA Agent Toolkit,…

38dTutorial#langchain#gpuby Sean Lopp

39d ago

Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere

AI-native services are exposing a new bottleneck in AI infrastructure: As millions of users, agents, and devices demand access to intelligence, the challenge is shifting from peak training throughput to delivering deterministic inference at scale—predictable latency, jitter, and sustainable token economics. NVIDIA announced at GTC 2026 that telcos and distributed cloud providers are transforming their networks into AI grids, embedding accelerated computing across a mesh of regional POPs, central offices, metro hubs, and edge locations to meet the needs of AI-native services. This post explains how AI grids make real-time, multi-modal, and hyper-personalized AI experiences viable at scale by running inference across distributed, workload-, resource- and KPI-aware AI infrastructure. Intelligent workload placement across distributed sites The NVIDIA AI Grid reference design provides a unified framework for building geographically distributed, interconnected, and orchestrated AI infrastructure. Figure 1 shows how existing network…

39dInfra#gpuby Sree Sankar

40d ago

Using Simulation to Build Robotic Systems for Hospital Automation

Healthcare faces a structural demand–capacity crisis: a projected global shortfall of ~10 million clinicians by 2030, billions of diagnostic exams annually with significant unmet demand, hundreds of millions of procedures with large access gaps, and costly operating room (OR) inefficiencies measured in tens of dollars per minute. The future hospital must therefore be automation-enabled—where robotics extends clinician capacity, increases procedural throughput, reduces variability, and democratizes access to high-quality care. Imagine autonomous imaging robots navigating patient anatomy to provide X-rays for the unserved billions, while in the OR, ‘Surgical Subtask Automation’ handles repetitive suturing so surgeons can focus on critical decisions. Beyond the bedside, service robots recapture wasted minutes by autonomously delivering supplies, saving nurses miles of walking. The data gap and real-world limits The core bottleneck is data. Hospitals are heterogeneous, chaotic, and high-stakes environments—every facility has different layouts, workflows,…

40dInfra#agents#inferenceby Mingxin Zheng

40d ago

Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

AI‑native organizations increasingly face scaling challenges as agentic AI workflows drive context windows to millions of tokens and models scale toward trillions of parameters. These systems rely on agentic long‑term memory for context that persists across turns, tools, and sessions so agents can build on prior reasoning instead of starting from scratch on every request. As context windows increase, Key-Value (KV) cache capacity requirements grow proportionally, while the compute requirements to recalculate that history grow much faster, making KV cache reuse and efficient storage essential for performance and efficiency. This increases pressure on existing memory hierarchies, forcing AI providers to choose between scarce GPU high‑bandwidth memory (HBM) and general‑purpose storage tiers optimized for durability, data management, and protection—not for serving ephemeral, AI-native, KV cache—driving up power consumption, inflating cost per token, and leaving expensive GPUs underutilized. The NVIDIA Vera Rubin…

40dInfra#rag#agents#gpuby Moshe Anschel

40d ago

How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

Reasoning models are growing rapidly in size and are increasingly being integrated into agentic AI workflows that interact with other models and external tools. Deploying these models and workflows in production environments requires distributing them across multiple GPU nodes, which demands careful orchestration and coordination across GPUs. NVIDIA Dynamo 1.0—available now—addresses these problems by accelerating generative AI and reasoning models in large-scale distributed environments. The AI framework delivers low-latency, high-throughput, distributed inference for production-grade multi-node AI deployments. Dynamo supports leading open source inference engines, including SGLang, NVIDIA TensorRT LLM, and vLLM. It also has delivered strong results in trusted third-party benchmarks such as MLPerf and SemiAnalysis InferenceX, reinforcing its position as a production-grade inference platform. Dynamo can boost the number of requests served by up to 7x on NVIDIA Blackwell, as demonstrated in the recent SemiAnalysis InferenceX benchmark. SemiAnalysis InferenceX,…

40dAgents#agents#inference#gpuby Amr Elmeleegy

40d ago

Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

Autonomous AI agents are driving the next wave of AI innovation. These agents must often manage long-running tasks that use multiple communication channels and background subprocesses simultaneously to explore options, test solutions, and generate optimal results. This places extreme demands on local compute. NVIDIA DGX Spark provides the performance necessary for autonomous agents to execute these complex workflows efficiently and locally. Now with NVIDIA NemoClaw, part of the NVIDIA Agent Toolkit, it installs the NVIDIA OpenShell runtime—a secure environment for running autonomous agents, and open source models like NVIDIA Nemotron. This post discusses several important aspects of system capabilities and performance that are necessary to power always-on autonomous agents and explains why NVIDIA DGX Spark is an ideal desktop platform for autonomous AI. Inference for autonomous AI agents Agentic tools often need to process massive context windows. OpenClaw, for example,…

40dInfra#agents#gpuby Allen Bourgoyne

40d ago

Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

Building AI factories is complex and requires efficient integration across compute, networking, security, and storage systems. To achieve rapid Time to AI and strong ROI, the new NVIDIA DSX Air is enabling organizations to simulate their entire AI factory infrastructure in the cloud—covering compute, networking, storage, and security. Being able to design, test, and optimize systems before deploying hardware enables every layer of the AI factory to function as a unified, optimized system, preventing major delays or performance issues related to integration or misconfiguration challenges. DSX Air also enables continuous testing and validation of provisioning, automation, and security policies to streamline ongoing operations. This post shows how users can benefit from NVIDIA DSX Air through accelerated deployment timelines and simplified, full-stack cluster management. How DSX Air enables AI factory simulation To make AI factory simulation useful and practical for end…

40dInfra#rag#gpuby Ranga Maddipudi

40d ago

NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories

AI is evolving, and reasoning models are increasing token demand, placing new requirements on every layer of AI infrastructure. More than ever, compute must scale efficiently to maximize token production and improve productivity for model creators and users. Modern GPUs operate at peak capacity, pushing throughput higher every generation, but system performance is increasingly gated by the CPU-bound serial tasks within an agentic loop–a classic example of a core computer science principle, called Amdahl’s law. This dynamic is especially visible in two classes of workloads: reinforcement learning (RL) for training models with new specialized skills such as coding or engineering, and agentic actions, which enable AI agents to use tools like web browsers, databases, code interpreters, and other software to complete tasks in real environments, or sandboxes. Both workloads combine two historically separate CPU characteristics. Individual environments require strong single-threaded…

40dInfra#gpuby Praveen Menon

40d ago

Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

AI has evolved from assistants following your directions to agents that act independently. Called claws, these agents can take a goal, figure out how to achieve it, and execute indefinitely—while leaving you out of the loop. The more capable claws become, the harder they are to trust. And their self-evolving autonomy changes everything about the environment in which they operate. The infrastructure to run claws more safely didn’t exist, until now. NVIDIA at GTC announced NemoClaw, an open source stack that simplifies running OpenClaw always-on assistants—with a single command. It incorporates policy-based privacy and security guardrails, giving you control over your agents’ behavior and data handling. This enables self-evolving claws to run more safely in the cloud, on prem, on NVIDIA RTX PCs, and on NVIDIA DGX Spark. NVIDIA NemoClaw uses open source models—like NVIDIA Nemotron—alongside the NVIDIA OpenShell runtime,…

40dTutorial#agents#gpuby Ali Golshan

40d ago

Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

NVIDIA Groq 3 LPX is a new rack-scale inference accelerator for the NVIDIA Vera Rubin platform, designed for the low-latency and large-context demands of agentic systems. Co-designed with the NVIDIA Vera Rubin NVL72, LPX equips the AI factory with an engine optimized for fast, predictable token generation, while Vera Rubin NVL72 remains the flexible, general-purpose workhorse for training and inference, delivering high throughput across prefill and decode, including long-context processing, decode attention, and high-concurrency serving at scale. This combination matters because the agentic future demands a new category of inference. As generation speeds approach 1,000 tokens per second per user, models move beyond conversation-speed interaction toward speed of thought computing. At that rate, AI systems can reason, simulate, and respond continuously, enabling experiences that feel less like turn-based chat and more like real-time collaboration. This shift also raises the ceiling…

40dHardware#inference#gpuby Kyle Aubrey

40d ago

NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

Artificial intelligence is token-driven. Every prompt, reasoning step, and agent interaction generates tokens. Over the past year, token consumption has grown multifold and now exceeds 10 quadrillion tokens per year. And while the majority of tokens have been generated from humans interacting with AI, the new era is one in which most tokens will be generated from AI interacting with AI. Modern agentic systems plan tasks, invoke tools, execute code, retrieve data, and coordinate across continuous multistep workflows with numerous AI agents. These interactions generate large volumes of reasoning tokens, expand KV cache, and require CPU-based sandboxed environments to test and validate results generated by accelerated computing systems. This places low latency, high throughput demands across GPUs, CPUs, scale-up domains, scale-out networks, and storage. Delivering useful intelligence for these modern agentic systems requires fleets of purpose-built rack-scale systems that function…

40dInfra#agents#gpuby Rohil Bhargava

40d ago

Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

Physics forms the foundation of robotic simulation, enabling realistic modeling of motion and interaction. For tasks like locomotion and manipulation, simulators must handle complex dynamics such as contact forces and deformable objects. While most engines trade off speed for realism, Newton—a GPU-accelerated, open source simulator—is designed to do both. Newton 1.0 GA, announced at NVIDIA GTC 2026, is delivering an accelerated, production-ready foundation for dexterous manipulation and locomotion tasks. As an extensible physics engine built on NVIDIA Warp and OpenUSD, robots can learn how to handle complex tasks with greater precision, speed, and extensibility while using frameworks such as NVIDIA Isaac Lab and NVIDIA Isaac Sim. Newton is a modular framework that brings together multiple solvers and simulation components behind a unified architecture. Rather than being tied to a single scene format, it supports a broad runtime data model that…

40dTutorial#open-source#gpuby Philipp Reist

43d ago

Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models

The next generation of AI-driven robots like humanoids and autonomous vehicles depends on high-fidelity, physics-aware training data. Without diverse and representative datasets, these systems don’t get proper training and face testing risks due to poor generalization, limited exposure to real-world variations, and unpredictable behavior in edge cases. Collecting massive real-world datasets for training is expensive, time-intensive, and often constrained by possibilities. NVIDIA Cosmos addresses this challenge by accelerating world foundation model (WFM) development. At the core of its platform, Cosmos WFMs speed up synthetic data generation and act as a foundation for post-training, to develop downstream domain or task-specific physical AI models to solve these challenges. This post explores the latest Cosmos WFMs, their key capabilities that advance physical AI, and how to use them. Cosmos world foundation model updates: NVIDIA Cosmos world foundation models have continued to evolve rapidly,…

43dTutorial#agents#training#gpuby Pranjali Joshi

44d ago

Build Accelerated, Differentiable Computational Physics Code for AI with NVIDIA Warp

Computer-aided engineering (CAE) is shifting from human-driven workflows toward AI-driven ones, including physics foundation models that generalize across geometries and operating conditions. Unlike LLMs, these models depend on large volumes of high-fidelity, physics-compliant data. Recent scaling-law work on computational fluid dynamics (CFD) surrogates indicates that simulation-generated training data is often the limiting cost in practice. This pushes requirements onto the simulator, which must be GPU-native, fast, and able to plug directly into ML workflows. NVIDIA Warp is a framework for accelerated simulation, data generation, and spatial computing that bridges CUDA and Python. Warp enables developers to write high-performance kernels as regular Python functions that are JIT-compiled into efficient code for execution on the GPU. Unlike the tensor-based frameworks, in which developers express computation as operations on entire N-dimensional arrays, developers author flexible kernels in the Warp framework that execute simultaneously…

44dInfra#agents#coding#gpuby Sheel Nidhan

44d ago

Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes

Every AI cluster running on Kubernetes requires a full software stack that works together, from low-level driver and kernel settings to high-level operator and workload configurations. You get one cluster working, and spend days getting the next one to match. Upgrade a component, and something else breaks. Move to a new cloud and start over. AI Cluster Runtime is a new open-source project designed to remove cluster configuration from the critical path. It publishes optimized, validated, and reproducible Kubernetes configurations as recipes you can deploy onto your clusters. How AI Cluster Runtime works To support GPU clusters across cloud and on-premises AI factories, NVIDIA validates specific combinations of drivers, runtimes, operators, kernel modules, and system settings for AI workloads. AI Cluster Runtime publishes those results as recipes. These version-locked YAML files capture which components were tested, the versions, and the…

44dHardwareby Mark Chmarny

44d ago

Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

Physical AI is rapidly evolving, from next-generation software-defined autonomous vehicles (AVs) to humanoid robots. The challenge is no longer how to run a large language model (LLM), but how to enable high-fidelity reasoning, real-time multimodal interaction, and trajectory planning within strict power and latency envelopes. NVIDIA TensorRT Edge-LLM, a high-performance C++ inference runtime for LLMs and vision language models (VLMs) on embedded platforms, is designed to overcome these challenges. As explained in this post, the latest TensorRT Edge-LLM release delivers a significant expansion in fundamental capabilities for NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor platforms. It introduces advanced edge architectures, including mixture of experts (MoE), the NVIDIA Cosmos Reason 2 open planning model for physical AI, and Qwen3-TTS and Qwen-ASR models for embedded speech processing. Building on these foundational pillars, the release also offers optimized support for the NVIDIA…

44dTutorial#agentsby Lin Chai

45d ago

Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning

Agentic AI systems need models with the specialized depth to solve dense technical problems autonomously. They must excel at reasoning, coding, and long-context analysis, while remaining efficient enough to run continuously at scale. Multi-agent systems generate up to 15x the tokens of standard chats, re-sending history, tool outputs, and reasoning steps at every turn. Over long tasks, this “context explosion” causes goal drift, where agents gradually lose alignment with the original objective. And using massive reasoning models for every sub-task—the “thinking tax”—makes multi-agent applications too expensive and sluggish for practical use. Today, we are releasing Nemotron 3 Super to address these limitations. The new Super model is a 120B total, 12B active-parameter model that delivers maximum compute efficiency and accuracy for complex multi-agent applications such as software development and cybersecurity triaging. This model follows the introduction of Nemotron 3 Nano…

45dAgents#agents#codingby Chris Alexiuk

46d ago

NVIDIA RTX Innovations Are Powering the Next Era of Game Development

NVIDIA RTX ray tracing and AI-powered neural rendering technologies are redefining how games are made, enabling a new standard for visuals and performance. At GDC 2026, NVIDIA unveiled the latest path tracing innovations elevating visual fidelity, on-device AI models enabling players to interact with their favorite experiences in new ways, and enterprise solutions accelerating game development from the ground up. This post provides a detailed overview of these latest innovations, including: - Introducing a new system for dense, path-traced foliage in NVIDIA RTX Mega Geometry - Adding path-traced indirect lighting with ReSTIR PT in the NVIDIA RTX Dynamic Illumination SDK and RTX Hair (beta) for strand-based acceleration in the NVIDIA branch of UE5 - Expanding language recognition support in NVIDIA ACE; production-quality on-device text-to-speech (TTS); a small language model (SML) with advanced agent capabilities for AI-powered game characters - Enabling…

46dAgents#agents#observability#local#gpuby Ike Nnoli

46d ago

Reliable AI Coding for Unreal Engine: Improving Accuracy and Reducing Token Costs

Agentic code assistants are moving into daily game development as studios build larger worlds, ship more DLCs, and support distributed teams. These assistants can accelerate development by helping with tasks like generating gameplay scaffolding, refactoring repetitive systems, and answering engine-specific questions faster. This post outlines how developers can build reliable AI coding workflows for Unreal Engine (UE) 5, from individual setups to team and enterprise-scale systems. Reliability is critical because real-world Unreal codebases are defined by engine conventions, large C++ projects, custom tools, branch differences, and studio-specific coding patterns that generic AI often fails to understand. The core challenge is the context gap. Failures rarely come from weak code generation, but from missing constraints such as code patterns, branch differences, or internal conventions. Improving context retrieval reduces guesswork and makes AI output reliable enough for production use. NVIDIA works with…

46dInfra#agents#codingby Paul Logan

47d ago

CUDA 13.2 Introduces Enhanced CUDA Tile Support and New Python Features

CUDA 13.2 arrives with a major update: NVIDIA CUDA Tile is now supported on devices of compute capability 8.X architectures (NVIDIA Ampere and NVIDIA Ada), as well as 10.X, 11.X and 12.X architectures (NVIDIA Blackwell). In an upcoming release of the CUDA Toolkit, all GPU architectures starting with Ampere will be fully supported. If you’re using Ampere, Ada, or Blackwell GPU architectures, check out the cuTile Python Quickstart guide to get started with CUDA Tile. This post explores the CUDA 13.2 release, which boosts developer productivity with a variety of new Python additions, including profiling in CUDA Python and debugging Numba kernels. The math libraries provide expanded support for high-performance emulated libraries, and CUDA Core Compute Libraries (CCCL) continue to add both performance and feature improvements, providing C++ developers with a high-performance, modern interface to GPU programming. cuTile Python cuTile…

47dHardware#local#gpuby Jonathan Bentz

47d ago

Implementing Falcon-H1 Hybrid Architecture in NVIDIA Megatron Core

In the rapidly evolving landscape of large language model (LLM) development, NVIDIA Megatron Core has emerged as the foundational framework for training massive transformer models at scale. The open source library offers industry-leading parallelism and GPU-optimized performance. Now developed GitHub-first in the NVIDIA/Megatron-LM repo, Megatron Core is increasingly shaped by contributions from foundation model builders, making it a more flexible, future-proofed engine for open AI models. This post provides a technical overview of how the Technology Innovation Institute (TII), creators of the Falcon model family, have contributed to and integrated with Megatron Core and Megatron Bridge frameworks. The first section examines the implementation of the Falcon-H1 parallel hybrid architecture within Megatron Bridge, highlighting the challenges of coordinating heterogeneous Transformer and Mamba layers alongside non-learnable µP multipliers. The second section explores the integration of BitNet into Megatron Core, detailing the replacement…

47dModel#training#gpuby Mireille Fares

47d ago

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library

Deploying large language models (LLMs) requires large-scale distributed inference, which spreads model computation and request handling across many GPUs and nodes to scale to more users while reducing latency. Distributed inference frameworks use techniques such as disaggregated serving, KV cache loading, and wide expert parallelism. In disaggregated serving environments, prefill and decode phases are run on separate GPUs, requiring efficient KV cache transfers between them. Low-latency and high-throughput communication to move these KV caches are critical to gain benefits from disaggregated serving. In KV cache loading, storage is used to help with growing KV caches in multiturn and agentic AI workloads such as coding assistants and reasoning. For the case of long context KV, the previous results can be loaded from local SSDs and remote storage, instead of recomputing them as prefill. This is one example that explains why storage…

47dHardware#inference#gpuby Seonghee Lee

47d ago

Removing the Guesswork from Disaggregated Serving

Deploying and optimizing large language models (LLMs) for high-performance, cost-effective serving can be an overwhelming engineering problem. The ideal configuration for any given workload (such as hardware, parallelism, and prefill/decode split) resides in a massive, multi-dimensional search space that is impossible to explore manually or through exhaustive testing. AIConfigurator, an open source tool that simplifies the NVIDIA Dynamo AI serving stack, is intended to cut through this complexity and get you to an optimal deployment in minutes. The core benefit of AIConfigurator is that you don’t need to run every possible configuration on real hardware to predict which one will perform best. Instead, it decomposes LLM inference into its constituent operations and measures each one in isolation on the target GPU. AIConfigurator can then reassemble those measurements to estimate the end-to-end performance of any configuration, all without occupying a single…

47dInfra#inferenceby Tianhao Xu

51d ago

Tuning Flash Attention for Peak Performance in NVIDIA CUDA Tile

In this post, we dive into one of the most critical workloads in modern AI: Flash Attention, where you’ll learn: - How to implement Flash Attention using NVIDIA cuTile. Walk through the complete code for a production-ready implementation. - The “trap and rescue” optimization journey. This case study shows how naive optimizations (like just increasing tile size) can backfire, and how to fix them. - Advanced techniques like FMA patterns, fast math, loop splitting, and adaptive tiling for maximum performance. Environment requirements: - CUDA 13.1 or higher - GPU architecture: Compute capability 8.X, 10.X, 11.X, 12.X (NVIDIA Ampere, NVIDIA Ada, NVIDIA Blackwell) - Python: 3.10 or higher See the quickstart doc for more information on installing cuTile Python. What is attention? The attention mechanism is the computational heart of transformer models. Given a sequence of tokens, attention enables each token…

51dTutorial#gpuby Alessandro Morari

51d ago

Controlling Floating-Point Determinism in NVIDIA CCCL

A computation is considered deterministic if multiple runs with the same input data produce the same bitwise result. While this may seem like a simple property to guarantee, it can be difficult to achieve in practice, especially in parallel programming and floating-point arithmetic. This is because floating-point addition and multiplication aren’t strictly associative—that is, (a + b) + c may not equal a + (b + c)—due to rounding that occurs when intermediate results are stored with finite precision. With NVIDIA CUDA Core Compute Libraries (CCCL) 3.1, CUB—a low-level CUDA library for speed-of-light parallel device algorithms—added a new single-phase API that accepts an execution environment, enabling users to customize algorithm behavior. We can use this environment to configure the reduce algorithm’s determinism property. This can only be done through the new single-phase API, since the two-phase API doesn’t accept an…

51dHardware#coding#gpuby Nader Al Awar

53d ago

How to Minimize Game Runtime Inference Costs with Coding Agents

NVIDIA ACE is a suite of technologies for building AI agents for gaming. ACE provides ready-to-integrate cloud and on-device AI models for every part of in-game characters, from speech to intelligence to animation. To run these models alongside the game engine efficiently, the NVIDIA In-Game Inferencing (NVIGI) SDK includes a set of performant libraries that developers can integrate into C++ games and applications. NVIDIA In-Game Inferencing SDK 1.5 introduces a new code agent sample in which an AI agent works with the player to defeat monsters in a 2D dungeon. AI agents driven by local small language models (SLMs) can make excessive calls to the GPU that compete with graphics. This post examines how to minimize the number of inference calls and maximize what each call accomplishes, reducing contention on the GPU between graphics and compute. Code agents: Trapping the…

53dTutorial#inference#coding#local#gpuby Brandon Rowlett

53d ago

cuTile.jl Brings NVIDIA CUDA Tile-Based Programming to Julia

NVIDIA CUDA Tile is one of the most significant additions to NVIDIA CUDA programming and unlocks automatic access to tensor cores and other specialized hardware. Earlier this year, NVIDIA released cuTile for Python, giving Python developers a natural way to write high-performance GPU kernels. Now, the same programming model is available in Julia through cuTile.jl. In this blog post, we’ll explore how cuTile.jl simplifies the development of high-performance CUDA kernels, demonstrate its idiomatic Julia syntax, and discuss its performance parity with the existing cuTile Python implementation. What is tile-based GPU programming? Traditional GPU programming with CUDA requires developers to think about threads, warps, and memory hierarchies. While powerful, this approach requires the programmer to map algorithms onto hardware efficiently. With CUDA Tile, developers describe operations on tiles of data, and the compiler handles the mapping to hardware. Consider vector addition.…

53dHardware#coding#gpuby Tim Besard

55d ago

Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

Autonomous networks are quickly becoming one of the top priorities in telecommunications. According to the latest NVIDIA State of AI in Telecommunications report, 65% of operators said AI is driving network automation, and 50% named autonomous networks as the top AI use case for ROI. Yet many telcos still report gaps in AI and data science expertise. This makes it difficult to scale safe, closed-loop automation across complex, multidomain networks. Most telecom network operations centers (NOCs) today operate using reactive, alarm-driven workflows. Engineers manually triage thousands of incidents across multiple tools, sift through a high volume of alarm and performance data, and stitch together fragmented dashboards and logs before applying a fix or dispatching a field team. NOCs are a natural starting point for autonomous networks, because they concentrate high-volume, repeatable tasks where AI can directly cut MTTR and OPEX.…

55d#rag#agents#gpuby Aiden Chang

55d ago

5 New Digital Twin Products Developers Can Use to Build 6G Networks

To make 6G a reality, the telecom industry must overcome a fundamental challenge: how to design, train, and validate AI-native networks that are too complex to be tested in the physical world. The NVIDIA Aerial Omniverse Digital Twin (AODT) solves this by enabling a continuous integration/continuous development (CI/CD)-style workflow where Radio Access Network (RAN) software is trained, simulated, and validated in a physics-accurate environment before field deployment. As discussed in a recent post, this approach bridges the gap between statistical models and real-world network performance. But the usability of any technology is as important as the technology itself. That’s why NVIDIA designed AODT not just as a powerful simulation platform, but with a modular and accessible architecture that partners and developers can easily integrate into their own workflows. Within two years of its launch, AODT’s modular architecture is growing an…

55dTutorial#codingby Cindy Goh

57d ago

Develop Native Multimodal Agents with Qwen3.5 VLM Using NVIDIA GPU-Accelerated Endpoints

Alibaba has introduced the new open source Qwen3.5 series built for native multimodal agents. The first model in this series is a ~400B parameter native vision-language model (VLM) with reasoning built with a hybrid architecture of mixture of experts (MoE) and Gated Delta Networks. Qwen3.5 can understand and navigate user interfaces, which improves on the previous generation of VLMs. Qwen3.5 is ideal for a variety of use cases, including: - Coding, including web development - Visual reasoning, including mobile and web interfaces - Chat applications - Complex search Build with NVIDIA endpoints You can start building with Qwen3.5 today with free access to GPU-accelerated endpoints on build.nvidia.com, powered by NVIDIA Blackwell GPUs. As part of the NVIDIA Developer Program, you can explore quickly in the browser, experiment with prompts, and even test the model with your own data to evaluate…

57dHardware#qwen#fine-tuning#multimodal#open-sourceby Anu Srivastava

57d ago

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM

Organizations deploying LLMs are challenged by inference workloads with different resource requirements. A small embedding model might use only a few gigabytes of GPU memory, while a 70B+ parameter LLM could require multiple GPUs. This diversity often leads to low average GPU utilization, high compute costs, and unpredictable latency. The problem isn’t just about packing more workloads onto GPUs but about scheduling them intelligently. Without orchestration that understands inference workload patterns, organizations face a choice between overprovisioning (wasting resources) and underprovisioning (degrading performance). This blog post covers: - The inference utilization problem: Why traditional scheduling underutilizes GPU resources. - How NVIDIA NIM delivers production inference: The role of containerized microservices in standardizing model deployment. - NVIDIA Run:ai’s intelligent scheduling strategies: Four key capabilities that enhance performance (lower latency, increase TPS/GPU) while increasing GPU utilization and reducing compute costs. - Benchmarking…

57dHardware#inference#embeddings#gpuby Shwetha Krishnamurthy

59d ago

Making Softmax More Efficient with NVIDIA Blackwell Ultra

LLM context lengths are exploding, and architectures are moving toward complex attention schemes like Multi-Head Latent Attention (MLA) and Grouped Query Attention (GQA). As a result, AI ”speed of thought” is increasingly governed not by the massive throughput of matrix multiplications, but by the transcendental math of the softmax function. Transcendentals refer to functions that cannot be expressed as the root of a polynomial equation with rational coefficients. Subsequently, they “transcend” basic algebraic operations like addition and multiplication—the exact operations Tensor Cores excel at. In the specific context of softmax, the most computationally expensive of these transcendentals is the natural exponential function that is executed on Special Function Units (SFUs). In NVIDIA assembly instructions (SASS), this function is invoked via the MUFU.EX2 instruction. This architectural split creates a softmax bottleneck within the attention block, when powerful matrix engines are forced…

59dHardware#gpuby Jamie Li

61d ago

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

As the sizes of AI models and datasets continue to increase, relying only on higher-precision BF16 training is no longer sufficient. Key challenges such as training throughput expectations, memory limits, and rising costs are becoming the primary barriers to scaling transformer models. Using lower-precision training can address these challenges. By reducing the numeric precision used during computation, GPUs can process more operations per cycle, enhancing training efficiency and lowering costs. This post compares the following three low-precision training formats directly against established BF16 precision training across multi-hundred-billion token pretraining runs and downstream benchmarks: - 8-bit floating point per-tensor current scaling (FP8-CS) - Mixed precision training with FP8 (MXFP8) - NVFP4 precision training using NVIDIA NeMo Megatron Bridge, an open source library that is part of NVIDIA NeMo framework We present practical, large-scale results showing how low-precision training delivers up to…

61dInfra#inference#trainingby Aditya Vavre

65d ago

Accelerating Data Processing with NVIDIA Multi-Instance GPU and Locality Domains

NVIDIA flagship data center GPUs in the NVIDIA Ampere, NVIDIA Hopper, and NVIDIA Blackwell families all feature non-uniform memory access (NUMA) behaviors, but expose a single memory space. Most programs therefore do not have an issue with memory non-uniformity. However, as bandwidth increases in newer generation GPUs, there are significant performance and power gains to be had when taking into consideration compute and data locality. This post first analyzes the memory hierarchy of the NVIDIA GPUs, discussing the power and performance impacts of data transfer over die-to-die link. It then reviews how to use NVIDIA Multi-Instance GPU (MIG) mode to achieve data localization. Finally, it presents results for running MIG mode versus unlocalized for the Wilson-Dslash stencil operator use case. Note: The techniques described in this post are exploratory, and the field is evolving quickly. New developments may supersede what…

65dHardware#gpuby Mukul Joshi

66d ago

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges through intelligent scheduling and dynamic GPU fractioning. GPU fractioning is wholly delivered by NVIDIA Run:ai in any environment—cloud, NCP, and on-premises. This post presents the joint benchmarking effort between NVIDIA and AI cloud provider Nebius to evaluate how NVIDIA Run:ai fractional GPU allocation can improve large language model (LLM) inference performance. Nebius’ AI Cloud provided the infrastructure foundation, dedicated NVIDIA GPUs, NVIDIA Quantum InfiniBand networking, and hyperscaler-grade performance and elasticity needed to deliver these gains at production scale. All benchmarks were executed using NVIDIA NIM microservices. This approach provides standardized, production-grade model deployment with consistent performance, security, and lifecycle management across environments. The results show that fractional GPUs dramatically increase effective capacity without compromising latency SLAs: - 77% of full…

66dHardware#inference#gpuby Boskey Savla

66d ago

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

Python dominates machine learning for its ergonomics, but writing truly fast GPU code has historically meant dropping into C++ to write custom kernels and to maintain bindings back to Python. For most Python developers and researchers, this is a significant barrier to entry. Frameworks like PyTorch address this by implementing kernels in CUDA C++—either handwritten or by leveraging libraries like the NVIDIA CUDA Core Compute Libraries. Handwritten kernels are time-consuming and require deep, low-level architectural expertise. Using CUB, a C++ library within CCCL, is often better, since its primitives are highly optimized per architecture and are rigorously tested. But exposing CUB to Python traditionally means building and maintaining bindings and pre-instantiating C++ templates with fixed types and operators—limiting flexibility on the Python side. The NVIDIA cuda.compute library overcomes these limitations by offering a high-level, Pythonic API for device-wide CUB primitives.…

66dResearch#coding#benchmark#gpuby Daniel Rodriguez

66d ago

How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models

As global AI adoption accelerates, developers face a growing challenge: delivering large language model (LLM) performance that meets real-world latency and cost requirements. Running models with tens of billions of parameters in production, especially for conversational or voice-based AI agents, demands high throughput, low latency, and predictable service-level performance. For startups building sovereign AI models from scratch, these challenges are amplified by the need to balance model scale and accuracy with infrastructure efficiency—while also maintaining data sovereignty and cost control. Sarvam AI, a generative AI startup based in Bengaluru, India, set out to build large, multilingual, multimodal foundation models that serve its country’s diverse population, support nearly two-dozen languages, and keep model development and data governance fully under India’s sovereign control. To meet strict latency targets and improve inference efficiency for its flagship Sovereign 30B model, Sarvam AI collaborated with…

66dHardware#inference#coding#gpuby Utkarsh Uppal

67d ago

Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities

Enterprise data is inherently complex: real-world documents are multimodal, spanning text, tables, charts and graphs, images, diagrams, scanned pages, forms, and embedded metadata. Financial reports carry critical insights in tables, engineering manuals rely on diagrams, and legal documents often include annotated or scanned content. Retrieval-augmented generation (RAG) was created to ground LLMs in trusted enterprise knowledge—retrieving relevant source data at query time to reduce hallucinations and improve accuracy. But if a RAG system processes only surrounding text, it misses key signals embedded in tables, charts, and diagrams—resulting in incomplete or incorrect answers. An intelligent agent is only as good as the data foundation it’s built on. Modern RAG must therefore be inherently multimodal—able to understand both visual and textual context to achieve enterprise-grade accuracy. The NVIDIA Enterprise RAG Blueprint is built for this, providing a modular reference architecture that connects…

67dInfra#rag#multimodalby Shruthii Sathyanarayanan

74d ago

R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab

Building robust, intelligent robots requires testing them in complex environments. However, gathering data in the physical world is expensive, slow, and often dangerous. It is nearly impossible to safely train for real-world critical risks, such as high-speed collisions or hardware failures. Worse, real-world data is usually biased toward “normal” conditions, leaving robots unprepared for the unexpected. Simulation is essential to bridge this gap, providing a risk-free environment for rigorous development. However, traditional pipelines struggle to support the complex needs of modern robotics. Today’s generalist robots must master multimodal learning—fusing diverse inputs such as vision, touch, and proprioception to navigate messy, unstructured worlds. This creates a new requirement for simulation: it must deliver scale, realism, and multimodal sensing all in one tight training loop, something traditional CPU-bound simulators cannot handle efficiently. This edition of NVIDIA Robotics Research and Development Digest (R²D²)…

74dInfra#multimodal#gpuby Oyindamola Omotuyi

74d ago

Using Accelerated Computing to Live-Steer Scientific Experiments at Massive Research Facilities

Scientists and engineers who design and build unique scientific research facilities face similar challenges. These include managing massive data rates that exceed current computational infrastructure capacity to extract scientific insights and driving the experiments in real time. These challenges are obstacles to maximizing the impact of scientific discoveries and significantly slow the pace of knowledge growth. Scientists and engineers at NVIDIA work with these facilities to develop new solutions built on parallel and distributed computation that remove these blockers. This post will walk through two notable examples of formalizing complex physics problems into tractable mathematical puzzles that benefit greatly from GPU-accelerated scientific computing, involving the U.S. Department of Energy: NSF-DOE Vera C. Rubin Observatory and SLAC’s Linac Coherent Light Source II (LCLS-II). These unique and massive-scale research facilities both took a decade to build and enable unprecedented scientific discoveries to…

74dResearchby Quynh L. Nguyen

75d ago

Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy

NVIDIA TensorRT LLM enables developers to build high-performance inference engines for large language models (LLMs), but deploying a new architecture traditionally requires significant manual effort. To address this challenge, today we are announcing the availability of AutoDeploy as a beta feature in TensorRT LLM. AutoDeploy compiles off-the-shelf PyTorch models into inference-optimized graphs. This avoids the need to bake inference-specific optimizations directly into model code, reducing LLM deployment time. AutoDeploy enables the shift from manually reimplementing and optimizing each model toward a compiler-driven workflow that separates model authoring from inference optimization. This post introduces AutoDeploy architecture and capabilities and shows how it enabled support for recent NVIDIA Nemotron models at launch. What is AutoDeploy? Every new LLM architecture comes with its own inference challenges, from transformer models to hybrid vision language models (VLMs) to state space models (SSMs). Turning a reference…

75dInfra#agents#inference#multimodal#codingby ​​Lucas Liebenwein

78d ago

3 Ways NVFP4 Accelerates AI Training and Inference

The latest AI models continue to grow in size and complexity, demanding increasing amounts of compute performance for training and inference—far beyond what Moore’s Law can keep up with. That’s why NVIDIA engages in extreme codesign. Designing across multiple chips and a mountain of software cohesively enables large generational leaps in AI factory performance and efficiency. Lower-precision AI formats are key to improving compute performance and energy efficiency. Bringing the benefits of ultra-low-precision numerics to AI training and inference while maintaining high accuracy requires extensive engineering across every layer of the technology stack. It spans the creation of the formats, implementation in silicon, enablement across many libraries, and working closely with the ecosystem to deploy new training recipes and inference optimization techniques. NVFP4, developed and implemented for NVIDIA GPUs starting with NVIDIA Blackwell, delivers the performance and energy-efficiency benefits of…

78dInfra#inference#trainingby Ashraf Eassa

79d ago

How to Build License-Compliant Synthetic Data Pipelines for AI Model Distillation

Specialized AI models are built to perform specific tasks or solve particular problems. But if you’ve ever tried to fine-tune or distill a domain-specific model, you’ve probably hit a few blockers, such as: - Not enough high-quality domain data, especially for proprietary or regulated use cases - Unclear licensing rules around synthetic data and distillation - High compute costs when a large model is excessive for targeted tasks - Slow iteration cycles that make it difficult to reach production-level ROI These challenges often prevent promising AI projects from progressing beyond the experimental phase. This post walks you through how to remove all four of these blockers using a production-ready, license-safe synthetic data distillation pipeline. Quick links - Nemotron 3 Nano on OpenRouter - NeMo Data Designer open source library - NeMo Data Designer: Product Information Dataset Generator with Q&A example…

79dTutorial#fine-tuningby Alex Steiner

79d ago

How Painkiller RTX Uses Generative AI to Modernize Game Assets at Scale

Painkiller RTX sets a new standard for how small teams can balance massive visual ambition with limited resources by integrating generative AI. By upscaling thousands of legacy textures into high-quality Physically Based Rendering (PBR) materials—a process that would have traditionally taken years—the team dramatically reduced the burden of repetitive work. This approach was especially impactful for contributors without traditional modding backgrounds, freeing them to focus on creative decisions: refining materials and ensuring the game’s iconic atmosphere responds correctly to ray-traced lighting. Learn how the team architected a production pipeline that blends automation with artistic judgment across 35 unique levels. To explore the motivations, solutions, and lessons behind these technical challenges, we spoke with McGillacutty (environment reconstruction and material lead), Quinn Baddams (team lead and founder of Merry Pencil Studios), and NightRaven (creator of PBRFusion). What’s your professional background and current…

79dInfraby Phillip Singh

80d ago

Build with Kimi K2.5 Multimodal VLM Using NVIDIA GPU-Accelerated Endpoints

Kimi K2.5 is the newest open vision language model (VLM) from the Kimi family of models. Kimi K2.5 is a general-purpose multimodal model that excels in current high-demand tasks such as agentic AI workflows, chat, reasoning, coding, mathematics, and more. The model was trained using the open source Megatron‑LM framework. Megatron-LM provides accelerated computing for scalability and GPU optimization through several types of parallelism (tensor, data, sequence) for training massive transformer-based models. This model architecture builds on leading state-of-the-art large open models for efficiency and capability. The model is composed of 384 experts with a single dense layer, which allows for smaller-sized experts and specialized routing for different modalities. Kimi K2.5 achieves a 3.2% activation rate of parameters per token. For vision capability, the large training vocabulary of 164K contains vision-specific tokens. Kimi created the MoonViT3d Vision Tower for the…

80dTutorial#fine-tuning#multimodal#gpuby Anu Srivastava

80d ago

How to Build a Document Processing Pipeline for RAG with Nemotron

What if your AI agent could instantly parse complex PDFs, extract nested tables, and “see” data within charts as easily as reading a text file? With NVIDIA Nemotron RAG, you can build a high-throughput intelligent document processing pipeline that handles massive document workloads with precision and accuracy. This post walks you through the core components of a multimodal retrieval pipeline step-by-step. First, we show you how to use the open source NVIDIA NeMo Retriever library to decompose complex documents into structured data using GPU-accelerated microservices. Then, we demonstrate how to wire that data into Nemotron RAG models to ensure your assistant provides grounded, accurate answers with full traceability back to the source. Let’s dive in. Quick links to the model and code Access the following resources for the tutorial: 🧠 Models on Hugging Face: - nvidia/llama-nemotron-embed-vl-1b-v2 multimodal embedding - nvidia/llama-nemotron-rerank-vl-1b-v2…

80dTutorial#rag#agents#gpuby Chia-Chih Chen

81d ago

Accelerating Long-Context Model Training in JAX and XLA

Large language models (LLMs) are rapidly expanding their context windows, with recent models supporting sequences of 128K tokens, 256K tokens, and beyond. However, training these models with extended context lengths presents significant computational and communication challenges. As context lengths grow, the memory and communication overhead of attention mechanisms scale quadratically, creating bottlenecks that traditional parallelism strategies struggle to address efficiently. This post demonstrates that integrating the NVSHMEM communication library into Accelerated Linear Algebra (XLA) compiler optimizes context parallelism. This integration enables the efficient training of Llama 3 8B model in JAX framework with sequences up to 256K tokens. Our results show that NVSHMEM provides up to 36% speedup over NVIDIA Collective Communications Library (NCCL) for long-context training workloads, particularly when combined with tensor parallelism across multiple nodes. The long-context training challenge To understand why NVSHMEM provides significant speedups for long-context…

81dModel#llama#training#gpuby Sevin Fide Varoglu

82d ago

Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel

In LLM training, Expert Parallel (EP) communication for hyperscale mixture-of-experts (MoE) models is challenging. EP communication is essentially all-to-all, but due to its dynamics and sparseness (only topk experts per AI token instead of all experts), it’s challenging to implement and optimize. This post details an efficient MoE EP communication solution, Hybrid-EP, and its use in the NVIDIA Megatron family of frameworks, on NVIDIA Quantum InfiniBand and NVIDIA Spectrum-X Ethernet platforms. It also dives into the effectiveness of Hybrid-EP in real-world model training. Efficiency challenges of hyperscale MoE model training DeepSeek-V3 is a representative model of the new generation of large-scale fine-grained MoE models. Such models balance computational overhead with model performance through “hyperparameter size sparse activation,” but they also pose serious challenges for existing large-model training frameworks. - Communication efficiency bottlenecks: The MoE model relies on parallel experts and…

82dModel#training#gpuby Fan Yu

85d ago

Advancing GPU Programming with the CUDA Tile IR Backend for OpenAI Triton

NVIDIA CUDA Tile is a GPU-based programming model that targets portability for NVIDIA Tensor Cores, unlocking peak GPU performance. One of the great things about CUDA Tile is that you can build your own DSL on top of it. This post shares the work NVIDIA is doing to integrate CUDA Tile as a backend for OpenAI Triton, an open source Python DSL designed to write DL kernels for GPUs. OpenAI Triton supports tiled computation, a technique that divides data and computational tasks into small blocks. Triton contains an MLIR-based compiler that generates PTX. This enables researchers without CUDA experience to write efficient GPU code. What are CUDA Tile and CUDA Tile IR? CUDA Tile extends the CUDA programming model to enable first-class support for tile programming. Introduced in CUDA 13.1, CUDA Tile represents a paradigm shift in GPU programming. Rather…

85dHardware#coding#gpuby Jie Xin

85d ago

Establishing a Scalable Sparse Ecosystem with the Universal Sparse Tensor

Sparse tensors are vectors, matrices, and higher-dimensional generalizations with many zeros. They are crucial in various fields such as scientific computing, signal processing, and deep learning due to their efficiency in storage, computation, and power. Despite their benefits, handling sparse tensors manually or through existing libraries is often cumbersome, error-prone, nonportable, and does not scale with the combinatorial explosion of sparsity patterns, data types, operations, and targets. Research largely focuses on sparse storage formats—data structures that compactly store nonzeros and allow efficient operations that avoid redundancies such as x+0=x and x*0=0. This enables scaling to larger sizes or solving same sizes with fewer resources. No single sparse format is optimal; the best choice depends on the nonzero distribution, operations, and target architecture. The Universal Sparse Tensor (UST) decouples a tensor’s sparsity from its memory storage representation. The UST uses a…

85dResearch#rag#embeddingsby Aart J.C. Bik

85d ago

Practical Security Guidance for Sandboxing Agentic Workflows and Managing Execution Risk

AI coding agents enable developers to work faster by streamlining tasks and driving automated, test-driven development. However, they also introduce a significant, often overlooked, attack surface by running tools from the command line with the same permissions and entitlements as the user, making them computer use agents, with all the risks those entail. The primary threat to these tools is that of indirect prompt injection, where a portion of the content ingested by the LLM driving the model is provided by an adversary through vectors such as malicious repositories or pull requests, git histories with prompt injections, .cursorrules , CLAUDE/AGENT.md files that contain prompt injections or malicious MCP responses. Such malicious instructions to the LLM can result in it taking attacker-influenced actions with adverse consequences. Manual approval of actions performed by the agent is the most common way to manage…

85dAgents#agents#codingby Rich Harang

87d ago

Ensuring Balanced GPU Allocation in Kubernetes Clusters with Time-Based Fairshare

NVIDIA Run:ai v2.24 introduces time-based fairshare, a new scheduling mode that brings fair-share scheduling with time awareness for over-quota resources to Kubernetes clusters. This capability, built on the open source KAI Scheduler that powers NVIDIA Run:ai, addresses a long-standing challenge in shared GPU infrastructure. Consider two teams with equal priority sharing a cluster. Team A continuously submits smaller jobs, while Team B needs to run a larger job that requires more resources. Every time resources free up, the smaller jobs from Team A fit immediately and get scheduled. The larger job from Team B continues to wait for enough resources to become available. Before that happens, the next small job from Team A claims the freed capacity. The result: although both teams have identical priority and entitlements, Team A runs job after job while the job from Team B sits…

87dHardware#gpuby Ekin Karabulut

87d ago

Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core

This post introduces Dynamic Context Parallelism (Dynamic-CP), a scheduling approach in NVIDIA Megatron Core used for LLM post-training or DiT pre-training. It dynamically selects the CP size per microbatch to efficiently handle variable-length sequences, achieving up to 1.48x speedup on real-world datasets. In large-scale model training, an often-overlooked bottleneck arises from the sequence-length variability in real-world datasets. Both LLM training and large-scale video generation have clear long-tail distributions in sequence length. A small fraction of ultra-long samples accounts for a disproportionately large share of the computational workload and memory consumption In LLM training, this leads to wide-ranging text sequence lengths across batches. In video generation, high-resolution, multi-second videos can span tens of thousands of tokens. This results in imbalanced sample-level FLOPs and memory usage across data-parallel ranks, modalities, and micro-batches, hindering efficient scheduling and resource utilization. To manage variable-length inputs,…

87dInfra#multimodal#training#gpuby Kunlun Li

87d ago

Updating Classifier Evasion for Vision Language Models

Advances in AI architectures have unlocked multimodal functionality, enabling transformer models to process multiple forms of data in the same context. For instance, vision language models (VLMs) can generate output from combined image and text input, enabling developers to build systems that interpret graphs, process camera feeds, or operate with traditionally human interfaces like desktop applications. In some situations, this additional vision modality may process external, untrusted images, and there’s significant precedent about the attack surface of image-processing machine learning systems. In this post, we’ll apply some of these historical ideas to modern architectures to help developers understand the various threats and mitigations unlocked in the vision domain. Vision language models VLMs extend the transformer architecture popularized by large language models (LLMs) to accept both text and image input. VLMs can be finetuned to caption, detect, and segment objects, and…

87dInfra#multimodalby Joseph Lucas

88d ago

Accelerating Diffusion Models with an Open, Plug-and-Play Offering

Recent advances in large-scale diffusion models have revolutionized generative AI across multiple domains, from image synthesis to audio generation, 3D asset creation, molecular design, and beyond. These models have demonstrated unprecedented capabilities in producing high-quality, diverse outputs across various conditional generation tasks. Despite these successes, sampling inefficiency remains a fundamental bottleneck. Standard diffusion models require tens to hundreds of iterative denoising steps, leading to high inference latency and substantial computational cost. This limits practical deployment in interactive applications, edge devices, and large-scale production systems. Video generation faces an especially critical challenge. Open source models such as NVIDIA Cosmos—along with commercial text-to-video (T2V) systems —have shown remarkable text-to-video capabilities. However, video diffusion models are orders of magnitude more computationally demanding due to the temporal dimension. Generating a single video can take minutes to hours, making real-time video generation, interactive editing, and…

88dHardware#inference#multimodal#open-source#gpuby Weili Nie

89d ago

Adaptive Inference in NVIDIA TensorRT for RTX Enables Automatic Optimization

Deploying AI applications across diverse consumer hardware has traditionally forced a trade-off. You can optimize for specific GPU configurations and achieve peak performance at the cost of portability. Alternatively, you can build generic, portable engines and leave performance on the table. Bridging this gap often requires manual tuning, multiple build targets, or accepting compromises. NVIDIA TensorRT for RTX seeks to eliminate this trade-off. At under 200 MB, this lean inference library provides a Just-In-Time (JIT) optimizer that compiles engines in under 30 seconds. This makes it ideal for real-time, responsive AI applications on consumer-grade devices. TensorRT for RTX introduces adaptive inference—engines that optimize automatically at runtime for your specific system, progressively improving compilation and inference performance as your application runs. No manual tuning, no multiple build targets, no intervention required. Build a lightweight, portable engine once, deploy it anywhere, and…

89dHardware#inference#gpuby George Stefanakis

89d ago

How to Unlock Local Detail in Coarse Climate Projections with NVIDIA Earth-2

Global climate models are good at the big picture—but local climate extremes, like hurricanes and typhoons, often disappear in the details. Those patterns are still there—you just need the right tools to unlock them in high-resolution climate data. Using NVIDIA Earth‑2, this blog post shows you how to downscale coarse climate projections into higher-resolution, bias‑corrected fields—revealing local detail not resolved in the raw data. Why downscaling climate projections is key to risk assessment High-resolution projections play a key role in assessing physical climate risk, informing decisions from infrastructure planning to agricultural adaptation. However, running global models at fine resolution is computationally prohibitive—requiring significant compute, storage, and time. Coupled Model Intercomparison Project Phase 6 (CMIP6) provides the most widely used global climate projections—underpinning IPCC reports and sector-specific risk models—but its outputs are often too coarse to capture short-lived weather events, like…

89dTutorial#local#gpuby Georg Ertl

93d ago

Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs

In 2025, NVIDIA partnered with Black Forest Labs (BFL) to optimize the FLUX.1 text-to-image model series, unlocking FP4 image generation performance on NVIDIA Blackwell GeForce RTX 50 Series GPUs. As a natural extension of the latent diffusion model, FLUX.1 Kontext [dev] proved that in-context learning is a feasible technique for visual-generation models, not just large language models (LLMs). To make this experience more widely accessible, NVIDIA collaborated with BFL to enable a near real-time editing experience using low-precision quantization. FLUX.2 is a significant leap forward, offering the public multi-image references and quality comparable to the best enterprise models. However, because FLUX.2 [dev] requires substantial compute resources, BFL, Comfy, and NVIDIA collaborated to achieve a major breakthrough: reducing the FLUX.2 [dev] memory requirement by more than 40% and enabling local deployment through ComfyUI. This optimization, using FP8 precision, has made FLUX.2…

93dHardware#inference#multimodal#gpuby Sandro Cavallari

94d ago

Streamlining CUB with a Single-Call API

The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional “two-phase” API, which separates memory estimation from allocation, can be cumbersome. While this programming model offers flexibility, it often results in repetitive boilerplate code. This post explains the shift from this API to the new CUB single-call API introduced in CUDA 13.1, which simplifies development by managing memory under the hood without sacrificing performance. What is CUB? If you need to run a standard algorithm (such as scan, histogram, or sort) on a GPU, CUB is likely the fastest way to do it. As a principal component of the NVIDIA CUDA Core Compute Libraries (CCCL), CUB is designed to abstract away the complexity of manual CUDA thread management without sacrificing performance. While libraries like Thrust provide a high-level, “host-side” interface similar to the C++…

94dHardwareby Giannis Gonidelis

100d ago

How to Train an AI Agent for Command-Line Tasks with Synthetic Data and Reinforcement Learning

What if your computer-use agent could learn a new Command Line Interface (CLI)—and operate it safely without ever writing files or free-typing shell commands? In Part 1 of our series on building a computer use agent, we built a custom Bash computer-use agent using NVIDIA Nemotron in just one hour. In this sequel, we’ll take it further by teaching the same reasoning model with no prior knowledge to safely operate the LangGraph Platform CLI. This shows how easily a large reasoning model can be specialized to perform new, agentic tasks. Instead of simple file operations, our new agent will learn to start local servers, build containers, and generate Dockerfiles—entirely through a verifiable, human-in-the-loop command interface. We’ll combine synthetic data generation (SDG) and Reinforcement Learning with Verifiable Rewards (RLVR), optimized via Group Relative Policy Optimization (GRPO), to make training both efficient…

100dTutorial#agentsby Chris Alexiuk

101d ago

How to Write High-Performance Matrix Multiply in NVIDIA CUDA Tile

This blog post is part of a series designed to help developers learn NVIDIA CUDA Tile programming for building high-performance GPU kernels, using matrix multiplication as a core example. In this post, you’ll learn: - How to implement high-performance matrix multiplication using NVIDIA cuTile: Understand the flow of Tile loading, computation, and storage. - About the block-level parallel programming mindset: Shift from thread-level thinking to block-level thinking. - Best practices for Tile programming: Learn performance optimization strategies from the code. Before you begin, be sure your environment meets the following requirements (see the quickstart for more information): Environment requirements: - CUDA 13.1 or higher - GPU architecture NVIDIA Blackwell (e.g., NVIDIA RTX 50 series) - Python: 3.10 or higher Install cuTile Python: pip install cuda-tile Note: cuTile is the next-generation GPU programming framework for NVIDIA. While it only supports optimization…

101dTutorial#coding#gpuby Jinman Xie

101d ago

NVIDIA DLSS 4.5 Delivers Super Resolution Upgrades and New Dynamic Multi Frame Generation

NVIDIA DLSS 4 with Multi Frame Generation has become the fastest-adopted NVIDIA gaming technology ever. Over 250 games and apps use it to make real-time path tracing possible—and upcoming titles for 2026, including PRAGMATA and Resident Evil Requiem, also plan to incorporate the software. At CES 2026, the technology became even more powerful. NVIDIA introduced DLSS 4.5 with a second-generation transformer model for super resolution, and a 6x mode for Multi Frame Generation and Dynamic Multi Frame Generation that automatically shifts the frame generation multiplier in real time to maximize smoothness across games and scenes. Today, developers can begin using the second-generation transformer model for DLSS Super Resolution to provide superior image quality. A more powerful DLSS Super Resolution model DLSS 4 introduced a transformer model architecture with NVIDIA GeForce RTX 50 Series GPUs. That enabled a leap in image…

101dHardware#rag#observability#coding#gpuby Ike Nnoli

102d ago

Learn How NVIDIA cuOpt Accelerates Mixed Integer Optimization using Primal Heuristics

NVIDIA cuOpt is a GPU-accelerated optimization engine designed to deliver fast, high-quality solutions for large, complex decision-making problems. Mixed integer programming (MIP) is a technique for solving problems. It can be modeled by a set of linear constraints, with some of the variables able to assume only integer values. The types of problems that can be modeled as MIP are numerous and include areas such as production planning, supply chain, transportation, scheduling, and finance. NVIDIA cuOpt is a GPU-accelerated solver built to deliver low-latency, high-quality optimization for large, constraint-heavy problems. At its core, cuOpt uses mixed-integer programming (MIP), which formulates decisions as linear constraints with both continuous and integer variables. This makes it well-suited for domains like supply-chain network optimization, routing and dispatch, workforce and task scheduling, production planning, and quantitative finance. Accelerated primal heuristics for MIP solvers are algorithms…

102dTutorial#gpuby Piotr Sielski

106d ago

Reimagining LLM Memory: Using Context as Training Data Unlocks Models That Learn at Test-Time

We keep seeing LLMs with larger context windows in the news, along with promises that they can hold entire conversation histories, volumes of books, or multiple codebases in view at once. And yet, these models still repeat the same mistakes. We still have to copy and paste the earlier context back into the chat for LLMs to “get it”. A smart co-worker would pick up on these patterns, adapt, and carry the lessons forward. Why can’t LLMs? In this blog post, we observe a critical difference between LLM memory and human memory. Then, we introduce test-time training with an end-to-end formulation (TTT-E2E), our latest research, in which the LLM compresses the context it’s reading into its weights through next-token prediction. Our key results are highlighted in Figure 1, which measures scaling with context length, in terms of loss (left) and…

106dTutorial#trainingby Yu Sun