$ timeahead_
all sourcesAhead of AI (Sebastian Raschka)Anthropic NewsApple Machine Learning ResearchArs Technica AIAWS Machine Learning BlogCerebras BlogCohere BlogCrewAI BlogDeepSeek BlogDistill.pubfast.ai BlogFireworks AI BlogGoogle AI BlogGoogle Cloud AI BlogGoogle DeepMind BlogGroq BlogHaystack (deepset) BlogHugging Face BlogImport AI (Jack Clark)LangChain BlogLangFuse BlogLil'Log (Lilian Weng)LlamaIndex BlogMeta AI BlogMicrosoft AutoGen BlogMicrosoft Research BlogMistral AI NewsMIT Technology ReviewModal Blogn8n BlogNathan Lambert (RLHF)NVIDIA Developer BlogOllama BlogOpenAI BlogPerplexity AI BlogPyTorch BlogReplicate BlogSimon Willison BlogTensorFlow BlogThe Batch (DeepLearning.AI)The GradientThe Verge AITogether AI BlogVentureBeat AIvLLM BlogWeights & Biases BlogWired AIxAI (Grok) Blog
allapiagentsframeworkshardwareinframodelopen sourcereleaseresearchtutorial
★ TOP STORY[ ATA ]Hardware·2d ago

US accuses China of “industrial-scale” AI theft. China says it’s “slander.”

The US is preparing to crack down on China’s allegedly “industrial-scale theft of American artificial intelligence labs’ intellectual property,” the Financial Times reported Thursday. Since the launch of DeepSeek—a Chinese model that OpenAI claimed was trained using outputs from its models—other AI firms have accused global rivals of using a method called distillation to steal their IP. In January, Google claimed that “commercially motivated” actors not limited to China attempted to clone its Gemini AI chatbot by promoting the model more than 100,000 times in bids to train cheaper copycats. The next month, Anthropic accused Chinese firms DeepSeek, Moonshot, and MiniMax of using the same tactic to generate “over 16 million exchanges with Claude through approximately 24,000 fraudulent accounts.” Also in February, OpenAI confirmed that most attacks it saw originated from China. For the US, these distillation attacks supposedly threaten…

Ars Technica AIread →
▲ trending · last 48hview all →
[ATA]Ars Technica AI· 2 articlesvisit →
3d ago
Google unveils two new TPUs designed for the "agentic era"
Most of the companies that have fully committed to building AI models are gobbling up every Nvidia AI accelerator they can get, but Google has taken a different approach. Most of its cloud AI infrastructure is based on its line of custom Tensor processing units (TPUs). After announcing the seventh-gen Ironwood TPU in 2025, the company has moved on to the eighth-gen version, but it’s not just a faster iteration of the same chip. The new TPUs come in two flavors, providing Google and its customers with an AI platform that is faster and more efficient, the company says. Google is pushing the idea that the “agent era” is fundamentally different from the AI systems that came before, necessitating a new approach to the hardware. So engineers have devised the TPU8t (for training) and the TPU 8i (for inference). Before…
3dHardware#agents#inference#trainingby Ryan Whitwam
4d ago
Anthropic gets $5B investment from Amazon, will use it to buy Amazon chips
Amazon has significantly boosted its multibillion-dollar bet on Claude developer Anthropic by investing an additional $5 billion—enabling Anthropic to eventually secure up to 5 gigawatts’ worth of AI chips from Amazon to help train and run its popular Claude AI models. Amazon is already one of Anthropic’s largest investors, having previously invested $8 billion in the AI startup. The latest move brings Amazon’s immediate investment up to $13 billion, and the companies have agreed to the possibility of Amazon committing another $20 billion in the future if the partnership achieves certain commercial milestones, according to Wall Street Journal reporting. The large cash infusion and prospect of obtaining more computing resources come at a crucial time for Anthropic, given the massive surge in paid subscriptions for Claude-related services early this year. That demand spike and strain on the existing cloud compute…
4dHardware#claudeby Jeremy Hsu
[AWS]AWS Machine Learning Blog· 1 articlesvisit →
5d ago
Accelerate Generative AI Inference on Amazon SageMaker AI with G7e Instances
Artificial Intelligence Accelerate Generative AI Inference on Amazon SageMaker AI with G7e Instances As the demand for generative AI continues to grow, developers and enterprises seek more flexible, cost-effective, and powerful accelerators to meet their needs. Today, we are thrilled to announce the availability of G7e instances powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs on Amazon SageMaker AI. You can provision nodes with 1, 2, 4, and 8 RTX PRO 6000 GPU instances, with each GPU providing 96 GB of GDDR7 memory. This launch provides the capability to use a single-node GPU, G7e.2xlarge instance to host powerful open source foundation models (FMs) like GPT-OSS-120B, Nemotron-3-Super-120B-A12B (NVFP4 variant), and Qwen3.5-35B-A3B, offering organizations a cost-effective and high-performing option. This makes it well suited for those looking to improve costs while maintaining high performance for inference workloads. The key highlights…
[FAB]Fireworks AI Blog· 5 articlesvisit →
22d ago
4/3/2026 Scaling and Optimizing Frontier Model Training
On this page How Fireworks scales frontier model training and offers the broadest set of fine-tunable MoE models on any platform. Training trillion-parameter Mixture-of-Experts (MoE) models has historically been bottlenecked by memory walls and complex cluster orchestration. Earlier this month, Cursor released Composer 2 — a frontier coding model that tops CursorBench at 61.3, SWE-bench Multilingual at 73.7, and Terminal-Bench at 61.7. Fireworks powers the Reinforcement Learning (RL) inference infrastructure behind it, proving that these bottlenecks can be overcome at scale. We have written about delta-compressed weight sync and multi-region rollout fleets, and about why numerical parity between training and inference is especially hard for MoE models. Those posts cover the inference half of the RL loop — rollouts, weight transfer, and numerical alignment. This post covers the last missing piece: the trainer itself. Our Training SDK provides the model…
187d ago
10/20/2025 Fireworks and AMD partner to power the next gen of AI infrastructure on AMD Instinct™ GPUs
Fireworks and AMD have entered into a multi-year strategic agreement to optimize AMD Instinct™ GPUs and accelerate adoption across AI-native companies, developers, and enterprises. We’re excited to share this new chapter in Fireworks’ mission to power the next generation of AI inference workloads. Our collaboration brings together AMD’s leadership in high-performance computing and Fireworks’ advanced AI stack to deliver scalable, production-grade AI systems that run inference faster, with the best quality, for the most efficient cost. For every organization and workload, there is a sweet spot where price, performance, and speed meet a technical and business outcome. By partnering with AMD, Fireworks provides best-in-class optimization technology alongside AMD Instinct™ GPUs. From model-serving runtimes to training frameworks, Fireworks is working closely with AMD to optimize every layer of our software stack for AMD Instinct™ MI325X and MI355X accelerators.. Tuning the Fireworks…
313d ago
6/16/2025 Build for Scale with Fireworks Virtual Cloud (GA)
Anyone who has run a production application at scale knows the impact that performance and reliability has on product success. For AI applications, the challenge is often to successfully operate a fleet of GPUs that handles scaled, globally distributed traffic, potentially in the midst of unprecedented growth. A few factors make managing bare-metal GPU deployments on your own difficult: Ultimately, these distract your team from what matters: building winning product experiences for users. That’s why today we’re excited to announce the GA of the Fireworks Virtual Cloud, a platform that abstracts away the complexity of managing GPU deployments, handling hardware failures, and scaling workloads across a global fleet. Launching with over 18 global regions across 8 cloud providers, including support for BYOC, Fireworks Virtual Cloud lets you build for scale from Day 1. To get started with Fireworks Virtual Cloud,…
313dHardware#inference
410d ago
Product 11/3/2025 40X Faster, and Smarter Outputs: How Vercel Turbocharged their Code Fixing Model with Open Models, Speculative Decoding and Reinforcement Fine Tuning on Fireworks
Vercel, a leading platform provider for full-stack web applications, partnered with Fireworks to solve a critical challenge for their AI code generation tool, v0: maximizing both output quality and inference speed at scale. The solution involved optimizing v0’s auto-fixer solution for customized workloads. By implementing advanced techniques, including Reinforcement Fine-Tuning (RFT) and speculative decoding, Fireworks delivered a massive step-change in performance and quality for Vercel. The result is a platform capable of achieving a 93% error-free generation rate and a 40X improvement in end-to-end latency for v0 users, setting a new benchmark for developer-facing AI tooling. Vercel's v0 composite model family is a specialized AI architecture designed to generate high-quality, error-free code for building fast, full-stack web applications. It's a powerful tool for developers because it addresses the limitations of other models by combining retrieval-augmented generation (RAG) for specialized knowledge,…
449d ago
1/31/2025 Distillation with Reasoning: Can DeepSeek R1 Teach Better Than Humans?
The recent release of DeepSeek R1 has taken the AI community by storm, offering performance on par with leading frontier models—such as OpenAI’s o1—at a fraction of the cost. Still, R1 can be expensive for use cases with high traffic or low latency requirements. DeepSeek R1’s strength lies in its explicit step-by-step reasoning. Before generating a final answer, it creates an internal “chain of thought” (CoT) to systematically reason through each problem. This process is a form of test-time computation, allowing the model to dynamically allocate more compute to complex problems. However, these extended reasoning sequences typically increase inference cost. Distillation is a method for transferring knowledge from a large, more powerful teacher model to a smaller, more cost-effective student model. According to the DeepSeek R1 paper, R1 is highly effective in this teacher role. Its detailed CoT sequences guide…
449dHardware
[GB]Groq Blog· 2 articlesvisit →
145d ago
Groq Recognized in 2025 Gartner® Cool Vendor in AI Infrastructure report
Groq Recognized in 2025 Gartner® Cool Vendor in AI Infrastructure report The next era of AI is here, one defined by fast, intelligent inference that scales as far as the world needs. Groq has been recognized as a 2025 Gartner Cool Vendor in AI Infrastructure. We believe this demonstrates the unique advantages LPUs deliver for real-time AI systems compared to traditional GPU architectures. The Gartner Cool Vendors report notes innovative infrastructure vendors that enable heads of infrastructure & operations to deploy AI more rapidly, optimize costs, and mitigate risks, resulting in more effective and future-ready AI initiatives. More than 2.5M developers choose Groq for performance that’s up to 5x faster and lower cost than GPU-based alternatives. This capability stems from the Groq LPU, a chip purpose-built for low-latency inference, which we deliver to developers worldwide with GroqCloud. Compared to GPU-based…
145dHardware#inference
333d ago
From Speed to Scale: How Groq Is Optimized for MoE & Other Large Models
From Speed to Scale: How Groq Is Optimized for MoE & Other Large Models You know Groq runs small models. But did you know we run large models including MoE uniquely well? Here’s why. The Evolution of Advanced Openly-Available LLMs There’s no argument that Artificial intelligence (AI) has exploded, in part because of the advancements in large language models (LLMs). These models have shown some amazing capabilities when it comes to natural language processing, from text generation to complex reasoning. As LLMs become even more sophisticated, one of the biggest challenges is scaling them efficiently. That’s where Groq comes in, a company at the forefront of AI hardware innovation, addressing this challenge with its groundbreaking LPU. In the past few years, the AI community has seen a surge in open-source LLMs, including models like Llama, DeepSeek, and Qwen. These models…
333dHardware#inference
[HF]Hugging Face Blog· 22 articlesvisit →
16d ago
Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs
Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs Try it What is Waypoint-1.5? Waypoint-1.5 is Overworld’s next real-time video world model, built to bring interactive generative worlds to the hardware people actually own. The first release of Waypoint showed that real-time generative worlds were possible. It proved that interactive world models could be more than passive video demos, and that locally runnable systems could begin to close the gap between generating a world and actually stepping into one. Waypoint-1.5 builds directly on that foundation. This release improves visual fidelity, expands the range of hardware that can run the model locally, and takes another step toward interactive world simulation without datacenter-scale compute. On desktop hardware including RTX 3090 through 5090, Waypoint-1.5 can generate real-time environments at up to 720p and 60 FPS. This release also introduces a 360p tier designed to run…
16dHardware
25d ago
Training mRNA Language Models Across 25 Species for $165
Training mRNA Language Models Across 25 Species for $165 Part II: Building the Pipeline, From Structure Prediction to Codon Optimization By OpenMed, Open-Source Agentic AI for Healthcare & Life Sciences TL;DR: We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below. Contents - What We Built - The Architecture Exploration - The Pipeline - Scaling to Multi-Species - The End-to-End Workflow - Where This Stands and What's Next - References Imagine going from…
29d ago
Liberate your OpenClaw
Liberate your OpenClaw 🦀 If you've been cut off and your OpenClaw, Pi, or Open Code agents need resuscitation, you can move them to open models in two ways: - Use an open model served through Hugging Face Inference Providers. - Run a fully local open model on your own hardware. The hosted route is the fastest way back to a capable agent. The local route is the right fit if you want privacy, zero API costs, and full control. To do so, just tell your claude code, your cursor or your favorite agent: help me move my OpenClaw agents to Hugging Face models, and link this page. Hugging Face Inference Providers Hugging Face inference providers is an open platform that routes to providers of open source models. It’s the right choice if you want the best models or you…
53d ago
PRX Part 3 — Training a Text-to-Image Model in 24h!
PRX Part 3 — Training a Text-to-Image Model in 24h! Introduction Welcome back 👋 In the last two posts (Part 1 and Part 2), we explored a wide range of architectural and training tricks for diffusion models. We tried to evaluate each idea in isolation, measuring throughput, convergence speed, and final image quality, and tried to understand what actually moves the needle. In this post, we want to answer a much more practical question: What happens when we combine all the tricks that worked? Instead of optimizing one dimension at a time, we’ll stack the most promising ingredients together and see how far we can push performance under a strict compute budget. To make things concrete, we’re doing a 24-hour speedrun: - 32 H200 - ~$1500 total compute budget (2$/hour/GPU) This is very far from the early diffusion days, where…
58d ago
Mixture of Experts (MoEs) in Transformers
Mixture of Experts (MoEs) in Transformers Introduction Over the past few years, scaling dense language models has driven most progress in LLMs. From early models like the original ULMFiT (~30M parameters) or GPT-2 (1.5B parameters, which at the time was considered "too dangerous to release" 🧌), and eventually to today’s hundred-billion–parameter systems, the recipe was simple: More data + more parameters gives better performance. Scaling laws reinforced this trend, but dense scaling has practical limits: - Training becomes increasingly expensive. - Inference latency grows. - Deployment requires significant memory and hardware. This is where Mixture of Experts (MoEs) enter the picture. If you're already familiar with MoEs and want to jump straight into the engineering work done in transformers, you can head directly to Transformers and MoEs. From Dense to Sparse: What Are MoEs? A Mixture of Experts model keeps…
75d ago
Transformers.js v4: Now Available on NPM!
Transformers.js v4: Now Available on NPM! npm i @huggingface/transformers Performance & Runtime Improvements The biggest change is undoubtedly the adoption of a new WebGPU Runtime, completely rewritten in C++. We've worked closely with the ONNX Runtime team to thoroughly test this runtime across our ~200 supported model architectures, as well as many new v4-exclusive architectures. In addition to better operator support (for performance, accuracy, and coverage), this new WebGPU runtime allows the same transformers.js code to be used across a wide variety of JavaScript environments, including browsers, server-side runtimes, and desktop applications. That's right, you can now run WebGPU-accelerated models directly in Node, Bun, and Deno! We've proven that it's possible to run state-of-the-art AI models 100% locally in the browser, and now we're focused on performance: making these models run as fast as possible, even in resource-constrained environments. This…
75dHardware#rag#coding
87d ago
We Got Claude to Build CUDA Kernels and teach open models!
We got Claude to teach open models how to write CUDA kernels! - You can take Opus 4.5 or other SOTA models and tackle the hardest problems out there. - You can take models that run on your laptop and upskill them to harder problems. In this blog post, we’ll show you how to take on the latter. This blog post walks through the process of using a new tool, upskill , to generate and evaluate agent skills with large models and use them with smaller models. We will benchmark upskill on the task of writing CUDA kernels for diffusers models, but the process is generally useful for cutting costs, or using smaller models on hard and domain-specific problems. What are agent skills? In case you missed it, agent skills are taking the coding agent game by storm. In fact,…
87dHardware#claude#gpu
95d ago
Differential Transformer V2
Differential Transformer V2 Notion Link (for better readability) Code We compare DIFF V2 with DIFF V1 below: (For simplicity, we omit the batch dimension and assume that both the input and output of the following flash_attn_func are three-dimensional tensors (tokens, heads, head dimension) . Heads belonging to the same GQA group are arranged contiguously in the output) Note DIFF V2 subtracts two heads that are in the same GQA group, which means they share the same key and value. This is crucial to performance. See design ablations section and Github code. def DiffAttnV1( layer_index, q1, q2, k1, k2, v, lam_q1, lam_k1, lam_q2, lam_k2, ): """ q1, q2: (N, h/2, d) k1, k2: (N, h_kv/2, d) v: (N, h_kv/2, 2d) lam_*: (d,) """ attn1 = flash_attn_func(q1, k1, v) attn2 = flash_attn_func(q2, k2, v) lam_init = 0.8 - 0.6 * \ exp(-0.3…
95dHardware#coding
180d ago
Streaming datasets: 100x More Efficient
Streaming datasets: 100x More Efficient TLDR We boosted load_dataset('dataset', streaming=True) , streaming datasets without downloading them with one line of code!Start training on multi-TB datasets immediately, without complex setups, downloading, no "disk out of space", or 429 “stop requesting!” errors. It's super fast! Outrunning our local SSDs when training on 64xH100 with 256 workers downloading data. We've improved streaming to have 100x fewer requests, → 10× faster data resolution → 2x sample/sec, → 0 worker crashes at 256 concurrent workers. Loading data, especially at the terabyte scale, is a major pain in any machine learning workflow. We suffered this while training SmolLM3, at one point we had to wait 3 hours before each run to download enough data. Streaming has always been possible in the datasets library, but large scale training with massive datasets remained a challenge. That changes today…
183d ago
LeRobot v0.4.0: Supercharging OSS Robot Learning
LeRobot v0.4.0: Supercharging OSS Robot Learning TL;DR LeRobot v0.4.0 delivers a major upgrade for open-source robotics, introducing scalable Datasets v3.0, powerful new VLA models like PI0.5 and GR00T N1.5, and a new plugin system for easier hardware integration. The release also adds support for LIBERO and Meta-World simulations, simplified multi-GPU training, and a new Hugging Face Robot Learning Course. Table-of-Contents - LeRobot v0.4.0: Supercharging OSS Robot Learning - TL;DR - Table-of-Contents - Datasets: Ready for the Next Wave of Large-Scale Robot Learning - Simulation Environments: Expanding Your Training Grounds - Codebase: Powerful Tools For Everyone - Policies: Unleashing Open-World Generalization - Robots: A New Era of Hardware Integration with the Plugin System - The Hugging Face Robot Learning Course - Final thoughts from the team Datasets: Ready for the Next Wave of Large-Scale Robot Learning We've completely overhauled our dataset…
192d ago
Get your VLM running in 3 simple steps on Intel CPUs
Get your VLM running in 3 simple steps on Intel CPUs While running AI models on your own device can be difficult as these models are often computationally demanding, it also offers significant benefits: including improved privacy since your data stays on your machine, and enhanced speed and reliability because you're not dependent on an internet connection or external servers. This is where tools like Optimum Intel and OpenVINO come in, along with a small, efficient model like SmolVLM. In this blog post, we'll walk you through three easy steps to get a VLM running locally, with no expensive hardware or GPUs required (though you can run all the code samples from this blog post on Intel GPUs). Deploy your model with Optimum Small models like SmolVLM are built for low-resource consumption, but they can be further optimized. In this…
192dHardware#coding#local
205d ago
SOTA OCR with Core ML and dots.ocr
SOTA OCR with Core ML and dots.ocr Enter the Neural Engine, Apple's custom AI accelerator that has shipped with every Apple device since 2017. This accelerator is designed for high performance whilst sipping battery power. Some of our testing has found the Neural Engine to be 12x more power efficient than CPU, and 4x more power efficient than GPU. Whilst this all sounds very appealing, unfortunately the Neural Engine is only accessible through Core ML, Apple's closed source ML framework. Furthermore, even just converting a model from PyTorch to Core ML can present some challenges, and without a preconverted model or some knowledge of the sharp edges it can be arduous for developers. Luckily, Apple also offers MLX, a more modern and flexible ML framework that targets the GPU (not the Neural Engine), and can be used in conjunction with…
205dHardware#coding
211d ago
Swift Transformers Reaches 1.0 – and Looks to the Future
Swift Transformers Reaches 1.0 – and Looks to the Future swift-transformers two years ago (!) with the goal to support Apple developers and help them integrate local LLMs in their apps. A lot has changed since then (MLX and chat templates did not exist!), and we’ve learned how the community is actually using the library. We want to double down on the use cases that provide most benefits to the community, and lay out the foundations for the future. Spoiler alert: after this release, we’ll focus a lot on MLX and agentic use cases 🚀 What is swift-transformers swift-transformers is a Swift library that aims to reduce the friction for developers that want to work with local models on Apple Silicon platforms, including iPhones. It includes the missing pieces that are not provided by Core ML or MLX alone, but…
235d ago
Make your ZeroGPU Spaces go brrr with ahead-of-time compilation
Make your ZeroGPU Spaces go brrr with ahead-of-time compilation This is where PyTorch ahead-of-time (AoT) compilation comes in. Instead of compiling models on the fly (which doesn’t play nicely with ZeroGPU’s short-lived processes), AoT lets you optimize once and reload instantly. The result: snappier demos and a smoother experience, with speedups ranging from 1.3×–1.8× on models like Flux, Wan, and LTX 🔥 In this post, we’ll show how to wire up Ahead-of-Time (AoT) compilation in ZeroGPU Spaces. We'll explore advanced tricks like FP8 quantization and dynamic shapes, and share working demos you can try right away. If you cannot wait, we invite you to check out some ZeroGPU-powered demos on the zerogpu-aoti organization. Pro users and Team / Enterprise org members can create ZeroGPU Spaces, while anyone can freely use them (Pro, Team and Enterprise users get 8x more ZeroGPU…
235dHardware
255d ago
Arm & ExecuTorch 0.7: Bringing Generative AI to the masses
Arm & ExecuTorch 0.7: Bringing Generative AI to the masses With Arm’s recent SME2 announcement, the role of Arm KleidiAI is increasingly clear as Arm’s AI accelerator layer powering the next wave of AI. By embedding into widely-used Edge AI frameworks like XNNPack, MediaPipe, MNN, ONNX Runtime, and even llama.cpp, KleidiAI has delivered substantial performance improvements with no code changes required by developers. That foundation leads directly to the upcoming ExecuTorch 0.7 beta, where KleidiAI will be enabled by default—bringing automatic acceleration to devices built on the latest Arm CPU architecture, as well as a vast base of existing phones built on earlier generations. Android and cross-platform developers—whether first- or third-party—gain instant access to KleidiAI AI performance optimizations via ExecuTorch and XNNPack. The result? Faster model startups, lower latency, leaner memory footprints—and no integration hurdles. What previously required custom tuning…
310d ago
(LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware
(LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware In our previous post, Exploring Quantization Backends in Diffusers, we dived into how various quantization techniques can shrink diffusion models like FLUX.1-dev, making them significantly more accessible for inference without drastically compromising performance. We saw how bitsandbytes , torchao , and others reduce memory footprints for generating images. Performing inference is cool, but to make these models truly our own, we also need to be able to fine-tune them. Therefore, in this post, we tackle efficient fine-tuning of these models with peak memory use under ~10 GB of VRAM on a single GPU. This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the diffusers library. We'll showcase results from an NVIDIA RTX 4090. We'll also highlight how FP8 training with torchao can further optimize speed on compatible hardware. Table of Contents -…
310dHardware#fine-tuning
317d ago
How Long Prompts Block Other Requests - Optimizing LLM Performance
How Long Prompts Block Other Requests - Optimizing LLM Performance The Simpler Challenge: Long Prompts Block the Queue Since individual decode steps are not compute-intensive, one can increase throughput by batching decodes of multiple requests. For prefill, however, this approach does not work. Because of the parallelized processing of all prompt tokens, a single prefill step can already saturate GPU utilization. Consequently, in the default chunked-prefill strategy of vLLM, each prefill chunk contains only prompt tokens of a single request. The next request in line has to wait until the previous prefill phase has been finished before its own prefill phase can start. This sequential scheduling of prefill chunks for different requests poses a challenge: whenever a request with a very long prompt is scheduled for prefill, any subsequent request has to wait for the duration of the long prefill…
317dHardware#inference#coding
317d ago
Featherless AI on Hugging Face Inference Providers 🔥
Featherless AI on Hugging Face Inference Providers 🔥 We're thrilled to share that Featherless AI is now a supported Inference Provider on the Hugging Face Hub! Featherless AI joins our growing ecosystem, enhancing the breadth and capabilities of serverless inference directly on the Hub’s model pages. Inference Providers are also seamlessly integrated into our client SDKs (for both JS and Python), making it super easy to use a wide variety of models with your preferred providers. Featherless AI supports a wide variety of text and conversational models, including the latest open-source models from DeepSeek, Meta, Google, Qwen, and much more. Featherless AI is a serverless AI inference provider with unique model loading and GPU orchestration abilities that makes an exceptionally large catalog of models available for users. Providers often offer either a low cost of access to a limited set…
326d ago
No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL
No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL 🚀 Introduction TRL supports training LLMs using GRPO, an online learning algorithm recently introduced in the DeepSeekMath paper. In GRPO, the model learns from its own outputs: it generates responses during training, receives feedback, and uses that feedback to improve itself over time. This makes generation a critical step in the training loop — and also a major bottleneck. To speed up generation, TRL integrates with vLLM. This combination lets you train powerful models more efficiently in GRPO setup. However, there’s a catch. 🧨 The Problem Before TRL v0.18.0, vLLM was only supported in server mode, running as a separate process on different GPUs from the training job. It communicated with the training script over HTTP, which made the setup modular and easy to use — but also introduced…
326dHardware#inference
374d ago
Prefill and Decode for Concurrent Requests - Optimizing LLM Performance
Prefill and Decode for Concurrent Requests - Optimizing LLM Performance At TNG, we are self-hosting numerous Large Language Models on our cluster of 24 H100 GPUs. It supports 50 different applications, handles over 5,000 inferences per hour, and generates more than ten million tokens every day. The Two Stages of Token Generation: Prefill and Decode Most LLMs generate text token by token, which guarantees that every new token is computed based on all preceding tokens (this model property is called auto-regressive). The first output token depends on all prompt tokens, but the second output token already depends on all prompt tokens plus the first output token, and so on. As a consequence, token generation cannot be parallelized at the level of an individual request. In LLMs with attention mechanisms, computing a new token requires calculating key, value, and query vectors…
388d ago
Efficient Request Queueing – Optimizing LLM Performance
Efficient Request Queueing – Optimizing LLM Performance Starting Point: A Bare Inference Engine An inference engine like vLLM or HuggingFace TGI consists of - a worker that does the actual work of calculating the next token in a request - a queue to which requests are added when they first arrive - a scheduler that takes requests from the queue and moves them to the worker Why do we need a queue here? Because calculations on the GPU are more performant and resource-efficient when they are done batch-wise instead of isolated for individual requests. This backend queue allows the scheduler to pick multiple requests and put them on the same batch to be processed. Note that typically each inference engine serves only a single model, and we have multiple deployments running for different models in parallel. Problem: "Power Users" Can…
388dHardware#inference
487d ago
Visualize and understand GPU memory in PyTorch
Visualize and understand GPU memory in PyTorch RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.93 GiB total capacity; 6.00 GiB already allocated; 14.88 MiB free; 6.00 GiB reserved in total by PyTorch) While it's easy to see that GPU memory is full, understanding why and how to fix it can be more challenging. In this tutorial, we'll go step by step on how to visualize and understand GPU memory usage in PyTorch during training. We’ll also see how to estimate memory requirements and optimize GPU memory usage. 🔎 The PyTorch visualizer PyTorch provides a handy tool for visualizing GPU memory usage: import torch from torch import nn # Start recording memory snapshot history torch.cuda.memory._record_memory_history(max_entries=100000) model = nn.Linear(10_000, 50_000, device ="cuda") for _ in range(3): inputs = torch.randn(5_000, 10_000, device="cuda") outputs = model(inputs) # Dump memory…
487dHardware
[NV]NVIDIA Developer Blog· 34 articlesvisit →
3d ago
Scaling the AI-Ready Data Center with NVIDIA RTX PRO 4500 Blackwell Server Edition and NVIDIA vGPU 20
AI integration is redefining mainstream enterprise applications, from productivity software like Microsoft Office to more complex design and engineering tools. This shift requires the modern data center to move beyond single-purpose silos. For developers, gaining access to dedicated GPU compute can often be a bottleneck. Virtual machines (VMs) solve part of this challenge by providing secure, isolated, and scalable environments tailored to specific project needs. However, dedicating an entire physical GPU to a single VM is highly inefficient for mixed or lightweight workloads. This is where NVIDIA Multi-Instance GPU (MIG) technology becomes essential. With MIG, a single physical GPU is partitioned at the hardware level into multiple fully independent instances, each with guaranteed memory, cache, and compute cores. For a development team, this ensures predictable, uncompromising Quality of Service (QoS). This means that multiple developers can simultaneously train AI models,…
3dHardware#gpuby Phoebe Lee
11d ago
NVIDIA NVbandwidth: Your Essential Tool for Measuring GPU Interconnect and Memory Performance
When you’re writing CUDA applications, one of the most important things you need to focus on to write great code is data transfer performance. This applies to both single-GPU and multi-GPU systems alike. One of the tools you can use to understand the memory characteristics of your GPU system is NVIDIA NVbandwidth. In this blog post, we’ll explore what NVbandwidth is, how it works, its key features, and how you can use it to test and evaluate your own NVIDIA GPU systems. This post is intended for CUDA developers, system architects, and ML infrastructure engineers who need to measure and validate GPU interconnect performance. What is NVbandwidth? NVbandwidth is a CUDA-based tool that measures bandwidth and latency for various memory copy patterns across different links using either copy engine (CE) or kernel copy methods. It reports the current measured bandwidth…
11dHardware#coding#gpuby Eva Sitaridi
11d ago
NVIDIA Ising Introduces AI-Powered Workflows to Build Fault-Tolerant Quantum Systems
NVIDIA Ising is the world’s first family of open AI models for building quantum processors, launching with two model domains: Ising Calibration and Ising Decoding. Both target the fundamental challenge in quantum computing—qubits are inherently noisy. The best quantum processors make an error roughly once in every thousand operations. To become useful accelerators for scientific and enterprise problems, error rates must drop to one in a trillion or better. AI is the most promising path to closing that gap at scale. Calibration is the process of understanding the noise in each quantum processor and tuning it to achieve the best possible performance. Calibration minimizes error, but because of noise in quantum systems, errors must be corrected in real time by a classical computer, faster than they accumulate. This process is called quantum error correction decoding. Both calibration and decoding are…
11dHardware#agents#coding#gpuby Tom Lubowe
16d ago
Running Large-Scale GPU Workloads on Kubernetes with Slurm
Slurm is an open source cluster management and job scheduling system for Linux. It manages job scheduling for over 65% of TOP500 systems. Most organizations running large-scale AI training have years of investment in Slurm job scripts, fair-share policies, and accounting workflows. The challenge is getting Slurm scheduling capabilities onto Kubernetes—the standard platform for managing GPU infrastructure at scale—without maintaining two separate environments. Slinky, an open source project developed by SchedMD (now part of NVIDIA), takes two approaches to this integration: - slurm-bridge brings Slurm scheduling to native Kubernetes workloads, allowing Slurm to act as a Kubernetes scheduler for pods - slurm-operator runs full Slurm clusters on Kubernetes infrastructure, managing the complete lifecycle of Slurm daemons as pods This post focuses on the slurm-operator, which is how NVIDIA runs Slurm on Kubernetes for large-scale GPU training clusters. It walks through…
16dHardware#open-sourceby Anton Polyakov
18d ago
Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling
The NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72 systems, featuring NVIDIA Blackwell architecture, are rack-scale supercomputers. They’re designed with 18 tightly coupled compute trays, massive GPU fabrics, and high-bandwidth networking packaged as a unit. For AI architects and HPC platform operators, the challenge isn’t just racking and stacking hardware—it’s turning infrastructure into safe, performant, and easy-to-use resources for end users. The mismatch between rack-scale hardware topology and scheduler abstractions is where most of the operational complexity lives. Left unaddressed, schedulers operate on a flat pool of GPUs and nodes, overlooking the system’s hierarchical and topology-sensitive design. This is the gap that a validated software stack, such as NVIDIA Mission Control, is designed to bridge. Mission Control provides rack-scale control planes for NVIDIA Grace Blackwell NVL72 systems. With a native understanding of NVIDIA NVLink and NVIDIA IMEX domains, it integrates with…
18dHardware#gpuby Ryan Prout
23d ago
Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight
In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU scheduling. In the previous post, Build High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6, this was described as the data-to-tensor gap—a performance mismatch between AI pipeline stages. The SMPTE VC-6 (ST 2117-1) codec addresses this gap through a hierarchical, tile-based architecture. Images are encoded as progressively refinable Levels of Quality (LoQs), each adding incremental detail. This enables selective retrieval and decoding of only the required resolution, region of interest, or color plane, with random access to independently decodable frames. Pipelines can retrieve and decode only what the model needs. However, efficient single-image execution does not automatically translate to efficient scaling. As batch sizes grow, the bottleneck shifts from single-image kernel efficiency to workload orchestration, launch cadence, and GPU occupancy.…
23dHardware#inference#multimodal#gpuby Andreas Kieslinger
24d ago
CUDA Tile Programming Now Available for BASIC!
Note: CUDA Tile Programming in BASIC is an April Fools’ joke, but it’s also real and actually works, demonstrating the flexibility of CUDA. CUDA 13.1 introduced CUDA Tile, a next generation tile-based GPU programming paradigm designed to make fine-grained parallelism more accessible and flexible. One of its key strengths is language openness: any programming language can target CUDA Tile, enabling developers to bring tile-based GPU acceleration into a wide range of ecosystems. In response to overwhelming demand from seasoned developers everywhere, we’re releasing cuTile BASIC for GPUs, bringing CUDA Tile programming to this long-overlooked language. What is cuTile BASIC? cuTile BASIC is an expression of the CUDA Tile programming model in BASIC, built on top of the CUDA Tile IR specification. It enables you to write tile kernels in BASIC using a tile-based model, which is a natural fit for…
24dHardware#coding#gpuby Rob Armstrong
24d ago
NVIDIA Platform Delivers Lowest Token Cost Enabled by Extreme Co-Design
Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak chip specifications. Rigorous AI inference performance benchmarks are critical to understanding real-world token output, which drives AI factory revenue. MLPerf Inference v6.0 is the latest in a series of industry benchmarks that measure performance across a wide range of model architectures and use cases. In this latest round, systems powered by NVIDIA Blackwell Ultra GPUs delivered the highest throughput across the widest range of models and scenarios. This brings the cumulative NVIDIA MLPerf training and inference wins since 2018 to 291, which is 9x of all other submitters combined. This round, the NVIDIA partner ecosystem participated broadly, with 14 partners—the largest number of partners submitting on any platform. ASUS, Cisco, CoreWeave, Dell Technologies, GigaComputing, Google Cloud,…
24dHardware#inference#gpuby Ashraf Eassa
24d ago
Accelerate Token Production in AI Factories Using Unified Services and Real-Time AI
In today’s AI factory environment, performance is not theoretical. It is economic, competitive, and existential. A 1% drop in usable GPU time can mean millions of tokens lost per hour. Minutes of congestion can cascade into hours of recovery. A rack-level power oversubscription can lead to stranded power and reduced tokens per watt, silently eroding factory output at scale. As AI factories scale to thousands of GPUs running diverse mission critical workloads, the cost of unpredictable congestion, power constraints, long-tail latency, and limited visibility grows exponentially. Operations teams and administrators need more than dashboards. They need flexibility and foresight. NVIDIA launched NVIDIA Mission Control as an integrated software stack for AI factories built on NVIDIA reference architectures, codifying NVIDIA best practices with a unified control plane. Mission Control version 3.0 expands further, introducing architectural flexibility, multi-org isolation, intelligent power orchestration…
24dHardwareby Pradyumna Desale
25d ago
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Spatial computing is moving from visualization to active collaboration, adding increasingly more GPU demands on XR hardware to render photorealistic, physics-accurate, high-fidelity spatial content in real time. Meanwhile, developers have had to maintain separate codebases for every platform, each with different toolchains, SDKs, and streaming protocols. At NVIDIA GTC 2026, NVIDIA CloudXR 6.0 introduced a universal OpenXR-based streaming runtime that works across headsets, operating systems, and browsers—including native visionOS integration. This post walks through how the CloudXR 6.0 architecture works and how to start building today. CloudXR 6.0: Universal OpenXR streaming The release focuses on expanding the reach of NVIDIA RTX-powered content to any spatial display without the constraints of local hardware or manual device provisioning. Native spatial streaming for Apple platforms NVIDIA and Apple have collaborated to build a high-performance bridge for Apple Vision Pro using privacy-protected foveated streaming…
25dHardware#gpuby Max Bickley
31d ago
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition (ASR) or text-to-speech (TTS) models may require only 10 GB of VRAM, yet occupy an entire GPU in standard Kubernetes deployments. Because the scheduler maps a model to one or more GPUs and can’t easily share across GPUs across models, expensive compute resources often remain underutilized. Solving this isn’t just about cost reduction—it’s about optimizing cluster density to serve more concurrent users on the same world-class hardware. This guide details how to implement and benchmark GPU partitioning strategies, specifically NVIDIA Multi-Instance GPU (MIG) and time-slicing to fully use compute resources. Using a production-grade voice AI pipeline as our testbed, we show how to combine models to maximize infrastructure ROI while maintaining >99% reliability and strict latency guarantees. Addressing GPU resource fragmentation By…
31dHardware#inferenceby Sagar Desai
31d ago
How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy
In the current state of automotive radar, machine learning engineers can’t work with camera-equivalent raw RGB images. Instead, they work with the output of radar constant false alarm rate (CFAR), which is similar to computer vision (CV) edge detections. The communications and compute architectures haven’t kept pace with trends in AI and the needs of Level 4 autonomy, despite radar being a staple of vehicle‑level sensing for years. The real 3D/4D “image” signal is instead processed inside the edge device. The radar outputs objects, or in some cases point clouds, which is similar to a camera outputting a classical CV Canny edge‑detection image. Centralized radar processing on NVIDIA DRIVE changes this model: Raw analog‑to‑digital converter (ADC) data moves into a centralized compute platform. From there, a software-defined pipeline accelerated by dedicated NVIDIA Programmable Vision Accelerator (PVA) hardware handles everything from…
31dHardware#gpuby Lachlan Dowling
33d ago
NVIDIA IGX Thor Powers Industrial, Medical, and Robotics Edge AI Applications
Industrial and medical systems are rapidly increasing the use of high-performance AI to improve worker productivity, human-machine interaction, and downtime management. From factory automation cells to autonomous mobile platforms to surgical rooms, operators are deploying increasingly complex generative AI models, more sensors, and higher‑fidelity data streams at the edge. Safety and regulatory compliance are meanwhile crucial to ensure deterministic behavior, high availability, and verifiable functional safety essential design requirements. This post introduces NVIDIA IGX Thor, a platform built for the demands of powering industrial AI at the edge, including a deep dive into performance and safety features. What is NVIDIA IGX Thor? NVIDIA IGX Thor is an enterprise-ready platform for physical AI. It offers server‑class AI performance together with industrial-grade hardware, advanced functional safety capabilities, extended lifecycle support, and an enterprise software stack in configurations suitable for industrial and medical…
33dHardware#agents#gpu#safetyby Suhas Hariharapura Sheshadri
40d ago
Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform
NVIDIA Groq 3 LPX is a new rack-scale inference accelerator for the NVIDIA Vera Rubin platform, designed for the low-latency and large-context demands of agentic systems. Co-designed with the NVIDIA Vera Rubin NVL72, LPX equips the AI factory with an engine optimized for fast, predictable token generation, while Vera Rubin NVL72 remains the flexible, general-purpose workhorse for training and inference, delivering high throughput across prefill and decode, including long-context processing, decode attention, and high-concurrency serving at scale. This combination matters because the agentic future demands a new category of inference. As generation speeds approach 1,000 tokens per second per user, models move beyond conversation-speed interaction toward speed of thought computing. At that rate, AI systems can reason, simulate, and respond continuously, enabling experiences that feel less like turn-based chat and more like real-time collaboration. This shift also raises the ceiling…
40dHardware#inference#gpuby Kyle Aubrey
44d ago
Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes
Every AI cluster running on Kubernetes requires a full software stack that works together, from low-level driver and kernel settings to high-level operator and workload configurations. You get one cluster working, and spend days getting the next one to match. Upgrade a component, and something else breaks. Move to a new cloud and start over. AI Cluster Runtime is a new open-source project designed to remove cluster configuration from the critical path. It publishes optimized, validated, and reproducible Kubernetes configurations as recipes you can deploy onto your clusters. How AI Cluster Runtime works To support GPU clusters across cloud and on-premises AI factories, NVIDIA validates specific combinations of drivers, runtimes, operators, kernel modules, and system settings for AI workloads. AI Cluster Runtime publishes those results as recipes. These version-locked YAML files capture which components were tested, the versions, and the…
44dHardwareby Mark Chmarny
47d ago
CUDA 13.2 Introduces Enhanced CUDA Tile Support and New Python Features
CUDA 13.2 arrives with a major update: NVIDIA CUDA Tile is now supported on devices of compute capability 8.X architectures (NVIDIA Ampere and NVIDIA Ada), as well as 10.X, 11.X and 12.X architectures (NVIDIA Blackwell). In an upcoming release of the CUDA Toolkit, all GPU architectures starting with Ampere will be fully supported. If you’re using Ampere, Ada, or Blackwell GPU architectures, check out the cuTile Python Quickstart guide to get started with CUDA Tile. This post explores the CUDA 13.2 release, which boosts developer productivity with a variety of new Python additions, including profiling in CUDA Python and debugging Numba kernels. The math libraries provide expanded support for high-performance emulated libraries, and CUDA Core Compute Libraries (CCCL) continue to add both performance and feature improvements, providing C++ developers with a high-performance, modern interface to GPU programming. cuTile Python cuTile…
47dHardware#local#gpuby Jonathan Bentz
47d ago
Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library
Deploying large language models (LLMs) requires large-scale distributed inference, which spreads model computation and request handling across many GPUs and nodes to scale to more users while reducing latency. Distributed inference frameworks use techniques such as disaggregated serving, KV cache loading, and wide expert parallelism. In disaggregated serving environments, prefill and decode phases are run on separate GPUs, requiring efficient KV cache transfers between them. Low-latency and high-throughput communication to move these KV caches are critical to gain benefits from disaggregated serving. In KV cache loading, storage is used to help with growing KV caches in multiturn and agentic AI workloads such as coding assistants and reasoning. For the case of long context KV, the previous results can be loaded from local SSDs and remote storage, instead of recomputing them as prefill. This is one example that explains why storage…
47dHardware#inference#gpuby Seonghee Lee
51d ago
Controlling Floating-Point Determinism in NVIDIA CCCL
A computation is considered deterministic if multiple runs with the same input data produce the same bitwise result. While this may seem like a simple property to guarantee, it can be difficult to achieve in practice, especially in parallel programming and floating-point arithmetic. This is because floating-point addition and multiplication aren’t strictly associative—that is, (a + b) + c may not equal a + (b + c)—due to rounding that occurs when intermediate results are stored with finite precision. With NVIDIA CUDA Core Compute Libraries (CCCL) 3.1, CUB—a low-level CUDA library for speed-of-light parallel device algorithms—added a new single-phase API that accepts an execution environment, enabling users to customize algorithm behavior. We can use this environment to configure the reduce algorithm’s determinism property. This can only be done through the new single-phase API, since the two-phase API doesn’t accept an…
51dHardware#coding#gpuby Nader Al Awar
53d ago
cuTile.jl Brings NVIDIA CUDA Tile-Based Programming to Julia
NVIDIA CUDA Tile is one of the most significant additions to NVIDIA CUDA programming and unlocks automatic access to tensor cores and other specialized hardware. Earlier this year, NVIDIA released cuTile for Python, giving Python developers a natural way to write high-performance GPU kernels. Now, the same programming model is available in Julia through cuTile.jl. In this blog post, we’ll explore how cuTile.jl simplifies the development of high-performance CUDA kernels, demonstrate its idiomatic Julia syntax, and discuss its performance parity with the existing cuTile Python implementation. What is tile-based GPU programming? Traditional GPU programming with CUDA requires developers to think about threads, warps, and memory hierarchies. While powerful, this approach requires the programmer to map algorithms onto hardware efficiently. With CUDA Tile, developers describe operations on tiles of data, and the compiler handles the mapping to hardware. Consider vector addition.…
53dHardware#coding#gpuby Tim Besard
57d ago
Develop Native Multimodal Agents with Qwen3.5 VLM Using NVIDIA GPU-Accelerated Endpoints
Alibaba has introduced the new open source Qwen3.5 series built for native multimodal agents. The first model in this series is a ~400B parameter native vision-language model (VLM) with reasoning built with a hybrid architecture of mixture of experts (MoE) and Gated Delta Networks. Qwen3.5 can understand and navigate user interfaces, which improves on the previous generation of VLMs. Qwen3.5 is ideal for a variety of use cases, including: - Coding, including web development - Visual reasoning, including mobile and web interfaces - Chat applications - Complex search Build with NVIDIA endpoints You can start building with Qwen3.5 today with free access to GPU-accelerated endpoints on build.nvidia.com, powered by NVIDIA Blackwell GPUs. As part of the NVIDIA Developer Program, you can explore quickly in the browser, experiment with prompts, and even test the model with your own data to evaluate…
57dHardware#qwen#fine-tuning#multimodal#open-sourceby Anu Srivastava
57d ago
Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM
Organizations deploying LLMs are challenged by inference workloads with different resource requirements. A small embedding model might use only a few gigabytes of GPU memory, while a 70B+ parameter LLM could require multiple GPUs. This diversity often leads to low average GPU utilization, high compute costs, and unpredictable latency. The problem isn’t just about packing more workloads onto GPUs but about scheduling them intelligently. Without orchestration that understands inference workload patterns, organizations face a choice between overprovisioning (wasting resources) and underprovisioning (degrading performance). This blog post covers: - The inference utilization problem: Why traditional scheduling underutilizes GPU resources. - How NVIDIA NIM delivers production inference: The role of containerized microservices in standardizing model deployment. - NVIDIA Run:ai’s intelligent scheduling strategies: Four key capabilities that enhance performance (lower latency, increase TPS/GPU) while increasing GPU utilization and reducing compute costs. - Benchmarking…
57dHardware#inference#embeddings#gpuby Shwetha Krishnamurthy
59d ago
Making Softmax More Efficient with NVIDIA Blackwell Ultra
LLM context lengths are exploding, and architectures are moving toward complex attention schemes like Multi-Head Latent Attention (MLA) and Grouped Query Attention (GQA). As a result, AI ”speed of thought” is increasingly governed not by the massive throughput of matrix multiplications, but by the transcendental math of the softmax function. Transcendentals refer to functions that cannot be expressed as the root of a polynomial equation with rational coefficients. Subsequently, they “transcend” basic algebraic operations like addition and multiplication—the exact operations Tensor Cores excel at. In the specific context of softmax, the most computationally expensive of these transcendentals is the natural exponential function that is executed on Special Function Units (SFUs). In NVIDIA assembly instructions (SASS), this function is invoked via the MUFU.EX2 instruction. This architectural split creates a softmax bottleneck within the attention block, when powerful matrix engines are forced…
59dHardware#gpuby Jamie Li
65d ago
Accelerating Data Processing with NVIDIA Multi-Instance GPU and Locality Domains
NVIDIA flagship data center GPUs in the NVIDIA Ampere, NVIDIA Hopper, and NVIDIA Blackwell families all feature non-uniform memory access (NUMA) behaviors, but expose a single memory space. Most programs therefore do not have an issue with memory non-uniformity. However, as bandwidth increases in newer generation GPUs, there are significant performance and power gains to be had when taking into consideration compute and data locality. This post first analyzes the memory hierarchy of the NVIDIA GPUs, discussing the power and performance impacts of data transfer over die-to-die link. It then reviews how to use NVIDIA Multi-Instance GPU (MIG) mode to achieve data localization. Finally, it presents results for running MIG mode versus unlocalized for the Wilson-Dslash stencil operator use case. Note: The techniques described in this post are exploratory, and the field is evolving quickly. New developments may supersede what…
65dHardware#gpuby Mukul Joshi
66d ago
Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai
As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges through intelligent scheduling and dynamic GPU fractioning. GPU fractioning is wholly delivered by NVIDIA Run:ai in any environment—cloud, NCP, and on-premises. This post presents the joint benchmarking effort between NVIDIA and AI cloud provider Nebius to evaluate how NVIDIA Run:ai fractional GPU allocation can improve large language model (LLM) inference performance. Nebius’ AI Cloud provided the infrastructure foundation, dedicated NVIDIA GPUs, NVIDIA Quantum InfiniBand networking, and hyperscaler-grade performance and elasticity needed to deliver these gains at production scale. All benchmarks were executed using NVIDIA NIM microservices. This approach provides standardized, production-grade model deployment with consistent performance, security, and lifecycle management across environments. The results show that fractional GPUs dramatically increase effective capacity without compromising latency SLAs: - 77% of full…
66dHardware#inference#gpuby Boskey Savla
66d ago
How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models
As global AI adoption accelerates, developers face a growing challenge: delivering large language model (LLM) performance that meets real-world latency and cost requirements. Running models with tens of billions of parameters in production, especially for conversational or voice-based AI agents, demands high throughput, low latency, and predictable service-level performance. For startups building sovereign AI models from scratch, these challenges are amplified by the need to balance model scale and accuracy with infrastructure efficiency—while also maintaining data sovereignty and cost control. Sarvam AI, a generative AI startup based in Bengaluru, India, set out to build large, multilingual, multimodal foundation models that serve its country’s diverse population, support nearly two-dozen languages, and keep model development and data governance fully under India’s sovereign control. To meet strict latency targets and improve inference efficiency for its flagship Sovereign 30B model, Sarvam AI collaborated with…
66dHardware#inference#coding#gpuby Utkarsh Uppal
85d ago
Advancing GPU Programming with the CUDA Tile IR Backend for OpenAI Triton
NVIDIA CUDA Tile is a GPU-based programming model that targets portability for NVIDIA Tensor Cores, unlocking peak GPU performance. One of the great things about CUDA Tile is that you can build your own DSL on top of it. This post shares the work NVIDIA is doing to integrate CUDA Tile as a backend for OpenAI Triton, an open source Python DSL designed to write DL kernels for GPUs. OpenAI Triton supports tiled computation, a technique that divides data and computational tasks into small blocks. Triton contains an MLIR-based compiler that generates PTX. This enables researchers without CUDA experience to write efficient GPU code. What are CUDA Tile and CUDA Tile IR? CUDA Tile extends the CUDA programming model to enable first-class support for tile programming. Introduced in CUDA 13.1, CUDA Tile represents a paradigm shift in GPU programming. Rather…
85dHardware#coding#gpuby Jie Xin
87d ago
Ensuring Balanced GPU Allocation in Kubernetes Clusters with Time-Based Fairshare
NVIDIA Run:ai v2.24 introduces time-based fairshare, a new scheduling mode that brings fair-share scheduling with time awareness for over-quota resources to Kubernetes clusters. This capability, built on the open source KAI Scheduler that powers NVIDIA Run:ai, addresses a long-standing challenge in shared GPU infrastructure. Consider two teams with equal priority sharing a cluster. Team A continuously submits smaller jobs, while Team B needs to run a larger job that requires more resources. Every time resources free up, the smaller jobs from Team A fit immediately and get scheduled. The larger job from Team B continues to wait for enough resources to become available. Before that happens, the next small job from Team A claims the freed capacity. The result: although both teams have identical priority and entitlements, Team A runs job after job while the job from Team B sits…
87dHardware#gpuby Ekin Karabulut
88d ago
Accelerating Diffusion Models with an Open, Plug-and-Play Offering
Recent advances in large-scale diffusion models have revolutionized generative AI across multiple domains, from image synthesis to audio generation, 3D asset creation, molecular design, and beyond. These models have demonstrated unprecedented capabilities in producing high-quality, diverse outputs across various conditional generation tasks. Despite these successes, sampling inefficiency remains a fundamental bottleneck. Standard diffusion models require tens to hundreds of iterative denoising steps, leading to high inference latency and substantial computational cost. This limits practical deployment in interactive applications, edge devices, and large-scale production systems. Video generation faces an especially critical challenge. Open source models such as NVIDIA Cosmos—along with commercial text-to-video (T2V) systems —have shown remarkable text-to-video capabilities. However, video diffusion models are orders of magnitude more computationally demanding due to the temporal dimension. Generating a single video can take minutes to hours, making real-time video generation, interactive editing, and…
89d ago
Adaptive Inference in NVIDIA TensorRT for RTX Enables Automatic Optimization
Deploying AI applications across diverse consumer hardware has traditionally forced a trade-off. You can optimize for specific GPU configurations and achieve peak performance at the cost of portability. Alternatively, you can build generic, portable engines and leave performance on the table. Bridging this gap often requires manual tuning, multiple build targets, or accepting compromises. NVIDIA TensorRT for RTX seeks to eliminate this trade-off. At under 200 MB, this lean inference library provides a Just-In-Time (JIT) optimizer that compiles engines in under 30 seconds. This makes it ideal for real-time, responsive AI applications on consumer-grade devices. TensorRT for RTX introduces adaptive inference—engines that optimize automatically at runtime for your specific system, progressively improving compilation and inference performance as your application runs. No manual tuning, no multiple build targets, no intervention required. Build a lightweight, portable engine once, deploy it anywhere, and…
89dHardware#inference#gpuby George Stefanakis
93d ago
Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs
In 2025, NVIDIA partnered with Black Forest Labs (BFL) to optimize the FLUX.1 text-to-image model series, unlocking FP4 image generation performance on NVIDIA Blackwell GeForce RTX 50 Series GPUs. As a natural extension of the latent diffusion model, FLUX.1 Kontext [dev] proved that in-context learning is a feasible technique for visual-generation models, not just large language models (LLMs). To make this experience more widely accessible, NVIDIA collaborated with BFL to enable a near real-time editing experience using low-precision quantization. FLUX.2 is a significant leap forward, offering the public multi-image references and quality comparable to the best enterprise models. However, because FLUX.2 [dev] requires substantial compute resources, BFL, Comfy, and NVIDIA collaborated to achieve a major breakthrough: reducing the FLUX.2 [dev] memory requirement by more than 40% and enabling local deployment through ComfyUI. This optimization, using FP8 precision, has made FLUX.2…
93dHardware#inference#multimodal#gpuby Sandro Cavallari
94d ago
Streamlining CUB with a Single-Call API
The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional “two-phase” API, which separates memory estimation from allocation, can be cumbersome. While this programming model offers flexibility, it often results in repetitive boilerplate code. This post explains the shift from this API to the new CUB single-call API introduced in CUDA 13.1, which simplifies development by managing memory under the hood without sacrificing performance. What is CUB? If you need to run a standard algorithm (such as scan, histogram, or sort) on a GPU, CUB is likely the fastest way to do it. As a principal component of the NVIDIA CUDA Core Compute Libraries (CCCL), CUB is designed to abstract away the complexity of manual CUDA thread management without sacrificing performance. While libraries like Thrust provide a high-level, “host-side” interface similar to the C++…
94dHardwareby Giannis Gonidelis
101d ago
NVIDIA DLSS 4.5 Delivers Super Resolution Upgrades and New Dynamic Multi Frame Generation
NVIDIA DLSS 4 with Multi Frame Generation has become the fastest-adopted NVIDIA gaming technology ever. Over 250 games and apps use it to make real-time path tracing possible—and upcoming titles for 2026, including PRAGMATA and Resident Evil Requiem, also plan to incorporate the software. At CES 2026, the technology became even more powerful. NVIDIA introduced DLSS 4.5 with a second-generation transformer model for super resolution, and a 6x mode for Multi Frame Generation and Dynamic Multi Frame Generation that automatically shifts the frame generation multiplier in real time to maximize smoothness across games and scenes. Today, developers can begin using the second-generation transformer model for DLSS Super Resolution to provide superior image quality. A more powerful DLSS Super Resolution model DLSS 4 introduced a transformer model architecture with NVIDIA GeForce RTX 50 Series GPUs. That enabled a leap in image…
101dHardware#rag#observability#coding#gpuby Ike Nnoli
107d ago
Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell
As AI models continue to get smarter, people can rely on them for an expanding set of tasks. This leads users—from consumers to enterprises—to interact with AI more frequently, meaning that more tokens need to be generated. To serve these tokens at the lowest possible cost, AI platforms need to deliver the best possible token throughput per watt. Through extreme co-design across GPUs, CPUs, networking, software, power delivery, and cooling, NVIDIA continues to drive up token throughput per watt, which reduces cost per million tokens. Additionally, NVIDIA continues to enhance its software stacks to achieve even greater levels of performance from existing platforms. This increases the value of the large installed base of NVIDIA GPUs across cloud service providers (CSPs), GPU clouds, model builders, enterprises, and others, enabling that infrastructure to remain productive for longer. In this post, we show…
107dHardware#inference#gpuby Ashraf Eassa
107d ago
Building Generalist Humanoid Capabilities with NVIDIA Isaac GR00T N1.6 Using a Sim-to-Real Workflow
To make humanoid robots useful, they need cognition and loco-manipulation that span perception, planning, and whole-body control in dynamic environments. Building these generalist robots requires a workflow that unifies simulation, control, and learning for robots to acquire complex skills before transferring into the real world. In this post, we present NVIDIA Isaac GR00T N1.6 and describe a sim-to-real workflow that combines whole-body reinforcement learning (RL) in NVIDIA Isaac Lab, synthetic data–trained navigation with COMPASS, and vision-based localization using NVIDIA CUDA-accelerated visual mapping and simultaneous localization and mapping (SLAM). These components enable loco-manipulation, robust navigation, and environment-aware behavior across diverse robot embodiments. Vision-language-action and reasoning GR00T N1.6 is a multimodal vision-language-action (VLA) model that integrates visual observations from egocentric camera streams, robot states, and natural language instructions into a unified policy representation. The model uses world models, such as NVIDIA Cosmos…
107dHardware#agents#multimodal#gpuby Edith Llontop
[OLL]Ollama Blog· 3 articlesvisit →
26d ago
Ollama is now powered by MLX on Apple Silicon in preview March 30, 2026 Today, we're previewing the fastest way to run Ollama on Apple silicon, powered by MLX, Apple's machine learning framework.
Ollama is now powered by MLX on Apple Silicon in preview March 30, 2026 Today, we’re previewing the fastest way to run Ollama on Apple silicon, powered by MLX, Apple’s machine learning framework. This unlocks new performance to accelerate your most demanding work on macOS: - Personal assistants like OpenClaw - Coding agents like Claude Code, OpenCode, or Codex Accelerate coding agents like Pi or Claude Code OpenClaw now responds much faster Fastest performance on Apple silicon, powered by MLX Ollama on Apple silicon is now built on top of Apple’s machine learning framework, MLX, to take advantage of its unified memory architecture. This results in a large speedup of Ollama on all Apple Silicon devices. On Apple’s M5, M5 Pro and M5 Max chips, Ollama leverages the new GPU Neural Accelerators to accelerate both time to first token (TTFT)…
26dHardware#llama
214d ago
New model scheduling September 23, 2025 Ollama now includes a significantly improved model scheduling system, reducing crashes due to out of memory issues, maximizing GPU utilization and performance, especially on multi-GPU systems.
New model scheduling September 23, 2025 Ollama now includes a significantly improved model scheduling system. Ahead of running a model, Ollama’s new engine will now measure the exact amount of memory required compared to an estimation in previous versions of Ollama. This has several benefits: - Significantly reduced crashes due to out of memory issues: Because memory management is exact, over-allocations no longer occur meaning fewer out of memory issues. - Maximizing GPU utilization: Ollama’s new memory management allocates more memory to the GPU, increasing token generation and processing speeds - Multi-GPU performance: Ollama will now schedule models more efficiently over multiple GPUs, significantly improving multi-GPU and mismatched GPU performance - Accurate reporting: Measurements in tools like nvidia-smi will now matchollama ps making it easy to track memory utilization on your system All models implemented in Ollama’s new engine now…
214dHardware#llama
218d ago
Cloud models September 19, 2025 Cloud models are now in preview, letting you run larger models with fast, datacenter-grade hardware. You can keep using your local tools while running larger models that wouldn’t fit on a personal computer.
Cloud models September 19, 2025 Cloud models are now in preview, letting you run larger models with fast, datacenter-grade hardware. You can keep using your local tools while running larger models that wouldn’t fit on a personal computer. Ollama’s cloud does not retain your data to ensure privacy and security. The same Ollama experience is now seamless across both local and in the cloud, integrating with the existing tools you already use. Ollama’s cloud models also work via Ollama’s OpenAI-compatible API. Get started Download Ollama v0.12, then open a terminal and run a cloud model: ollama run qwen3-coder:480b-cloud Available models qwen3-coder:480b-cloud gpt-oss:120b-cloud gpt-oss:20b-cloud deepseek-v3.1:671b-cloud Usage Cloud models behave like regular models. For example, you can ls , run , pull , and cp them as needed: % ollama ls NAME ID SIZE MODIFIED gpt-oss:120b-cloud 569662207105 - 5 seconds ago gpt-oss:20b-cloud…
218dHardware#local
[OAI]OpenAI Blog· 16 articlesvisit →
2d ago
Top 10 uses for Codex at work
Top 10 uses for Codex at work Try these 10 prompts to move real work forward with dashboards, decks, workflows, and more. You’ve seen what Codex can do. Now it’s time to put it to work. These use cases show how to use Codex to do real work: create deliverables, pull together context from multiple tools, take action on real inputs, and move tasks forward faster. Start with the generic prompt if you want something you can use right away, then use the customization suggestions and example to make it your own. You start the day by bouncing between your calendar, messages, email, and notes, trying to figure out what matters most. Codex can pull that context together, keep watch for changes, and turn it into one clear brief so you spend less time triaging and more time acting on…
2dHardware#agents
33d ago
Creating with Sora Safely
Loading… The Sora 2 model and the Sora app offer state-of-the-art video generation with a new way to create together, and we’ve made sure safety is built in from the very start. Our approach is anchored in concrete protections: - Distinguishing AI content. Every video generated with Sora includes both visible and invisible provenance signals. All Sora videos also embed C2PA metadata—an industry-standard signature—and we maintain internal reverse-image and audio search tools that can trace videos back to Sora with high accuracy, building on successful systems from ChatGPT image generation and Sora 1. Many outputs also carry visible, dynamically moving watermarks which include the name of the creator. - Image-to-video with real person likeness. As we continue to strengthen Sora’s guardrails, we’re enabling more creative expression and connection, including letting people create videos from photos of family and friends. Users…
94d ago
How Higgsfield turns simple ideas into cinematic social videos
Short-form video drives modern commerce, but producing video that actually performs is harder than it looks. Clips that feel effortless on TikTok, Reels, and Shorts are built on invisible rules: hook timing, shot rhythm, camera motion, pacing, and other subtle cues that make content feel “native” to whatever is trending. Higgsfield(opens in a new window) is a generative media platform that lets teams create short-form, cinematic videos from a product link, an image, or a simple idea. Using OpenAI GPT‑4.1 and GPT‑5 to plan and Sora 2 to create, the system generates roughly 4 million videos per day, turning minimal input into structured, social-first video. “Users rarely describe what a model actually needs. They describe what they want to feel. Our job is to translate that intent into something a video model can execute, using OpenAI models to turn goals…
94dHardware#gpt#multimodal
156d ago
OpenAI and Foxconn collaborate to strengthen U.S. manufacturing across the AI supply chain
OpenAI and Foxconn collaborate to strengthen U.S. manufacturing across the AI supply chain Today we’re announcing a collaboration with Hon Hai Technology Group (Foxconn) focused on design work and U.S. manufacturing readiness for the next generation of AI infrastructure hardware. As part of this work, OpenAI will share insight into emerging hardware needs across the AI industry to help inform Foxconn’s design and development efforts for hardware to be manufactured at Foxconn’s U.S. facilities. While this initial agreement does not include purchase commitments or financial obligations, OpenAI will have early access to evaluate these systems and an option to purchase them. As AI capabilities continue to advance, so has the need for a new class of physical infrastructure that is purpose-built for the demands of advanced models. By combining OpenAI’s insight into the needs of today’s and future models with…
156dHardware
194d ago
OpenAI and Broadcom announce strategic collaboration to deploy 10 gigawatts of OpenAI-designed AI accelerators
OpenAI and Broadcom announce strategic collaboration to deploy 10 gigawatts of OpenAI-designed AI accelerators Multi-year partnership enables OpenAI and Broadcom to deliver accelerator and network systems for next-generation AI clusters. News: - OpenAI and Broadcom will co-develop systems that include accelerators and Ethernet solutions from Broadcom for scale-up and scale-out - Broadcom to deploy racks of AI accelerator and network systems targeted to start in the second half of 2026, to complete by end of 2029 San Francisco and Palo Alto—October 13, 2025—OpenAI and Broadcom today announced a collaboration for 10 gigawatts of custom AI accelerators. OpenAI will design the accelerators and systems, which will be developed and deployed in partnership with Broadcom. By designing its own chips and systems, OpenAI can embed what it’s learned from developing frontier models and products directly into the hardware, unlocking new levels of…
194dHardware
197d ago
HYGH speeds development and campaigns with ChatGPT Business
HYGH speeds development and campaigns with ChatGPT Business From rapid MVPs to campaign previews, HYGH uses AI to cut turnaround times and deliver more creative options to advertisers. HYGH is a digital media company whose goal is to make outdoor advertising as easy to manage as online ads. Its tech platform connects more than 4,000 digital displays across Germany - from shop window screens to the country’s largest 3D LED billboard - to deliver data-driven ad content at high-impact touchpoints. But behind their growing network of screens, HYGH’s internal development processes were slowing them down. “We wanted to get out of the clunky process where even small internal tools required endless meetings and dependencies,” says HYGH’s co-founder, Antonius Link. Since starting to use ChatGPT Business, HYGH estimates they’re saving 5.5 hours per employee, per week. “Now one person can take…
197dHardware#gpt
201d ago
AMD and OpenAI announce strategic partnership to deploy 6 gigawatts of AMD GPUs
AMD and OpenAI announce strategic partnership to deploy 6 gigawatts of AMD GPUs News - OpenAI to deploy 6 gigawatts of AMD GPUs based on a multi-year, multi-generation agreement - Initial 1 gigawatt OpenAI deployment of AMD Instinct™ MI450 Series GPUs starting in 2H 2026 SANTA CLARA, Calif.—October 6, 2025—AMD(opens in a new window) (NASDAQ: AMD) and OpenAI today announced a 6 gigawatt agreement to power OpenAI’s next-generation AI infrastructure across multiple generations of AMD Instinct GPUs. The first 1 gigawatt deployment of AMD Instinct MI450 GPUs is set to begin in the second half of 2026. AMD’s strong leadership in high-performance computing systems and OpenAI's pioneering research and advancements in generative AI places the two companies at the forefront of this important and pivotal time for AI. Under this definitive agreement, OpenAI will work with AMD as a core…
201dHardware
206d ago
Samsung and SK join OpenAI’s Stargate initiative to advance global AI infrastructure
Samsung and SK join OpenAI’s Stargate initiative to advance global AI infrastructure Samsung, SK, and OpenAI today announced new strategic partnerships as part of OpenAI’s Stargate initiative, the company’s overarching AI infrastructure platform, aimed at expanding infrastructure critical to AI development, globally and in Korea. The announcement followed a meeting between President Lee Jae-myung, Samsung Electronics Executive Chairman Jay Y. Lee, SK Chairman Chey Tae-won, and OpenAI CEO Sam Altman at the Presidential Office in Seoul. These partnerships will focus on increasing the supply of advanced memory chips essential for next-generation AI and expanding data center capacity in Korea, positioning Samsung and SK as key contributors to global AI infrastructure and supporting Korea’s ambition to become a top-three global AI nation. Through these partnerships, Samsung Electronics and SK hynix plan to scale up production of advanced memory chips, targeting 900,000…
206dHardware
207d ago
Launching Sora responsibly
Loading… Sora 2 and the Sora app combine cutting-edge video generation with a new way to create together, and we’ve made sure safety is built in from the very start. Our approach is anchored in concrete protections: - Distinguishing AI content. Every video generated with Sora includes both visible and invisible provenance signals. At launch, all outputs carry a visible watermark. All Sora videos also embed C2PA metadata—an industry-standard signature—and we maintain internal reverse-image and audio search tools that can trace videos back to Sora with high accuracy, building on successful systems from ChatGPT image generation and Sora 1. - Consent-based likeness using characters. Our goal is to place you in control of your likeness end-to-end with Sora characters. We have guardrails intended to ensure that your audio and image likeness captured in characters are used with your consent. Only…
215d ago
OpenAI and NVIDIA announce strategic partnership to deploy 10 gigawatts of NVIDIA systems
OpenAI and NVIDIA announce strategic partnership to deploy 10 gigawatts of NVIDIA systems News - Strategic partnership enables OpenAI to build and deploy at least 10 gigawatts of AI datacenters with NVIDIA systems representing millions of GPUs for OpenAI’s next-generation AI infrastructure. - To support the partnership, NVIDIA intends to invest up to $100 billion in OpenAI progressively as each gigawatt is deployed. - The first gigawatt of NVIDIA systems will be deployed in the second half of 2026 on NVIDIA’s Vera Rubin platform. San Francisco and Santa Clara—September 22, 2025—NVIDIA and OpenAI today announced a letter of intent for a landmark strategic partnership to deploy at least 10 gigawatts of NVIDIA systems for OpenAI’s next-generation AI infrastructure to train and run its next generation of models on the path to deploying superintelligence. To support this deployment including datacenter and…
215dHardware#gpu
243d ago
Announcing the OpenAI Learning Accelerator
Introducing the OpenAI Learning Accelerator in India Today, OpenAI announced the launch of OpenAI Learning Accelerator, an India-first initiative that aims to bring advanced AI to India’s educators and millions of learners nationwide through AI research, training, and deployment. ChatGPT is now one of the most widely used learning tools in the world. Nowhere is this more true than in India, which is home to the largest student population on ChatGPT globally, with millions turning to it for homework help, exam prep, and to explore new ideas. The popularity of ChatGPT in learning also presents new challenges: how to ensure AI deepens rather than shortcuts learning, and how to help students build critical thinking skills when answers are instantly available. OpenAI Learning Accelerator is designed to address these challenges and empower educators and learners—to ensure AI strengthens learning, supports teachers,…
243dHardware
261d ago
From hard refusals to safe-completions: toward output-centric safety training
From hard refusals to safe-completions: toward output-centric safety training Introduced in GPT‑5, safe-completion is a new safety-training approach to maximize model helpfulness within safety constraints. Compared to refusal-based training, safe-completion improves both safety and helpfulness, especially in dual-use domains. If a user asks ChatGPT for the minimum energy needed to ignite a firework display, should it give a helpful answer? The user could be preparing for a July 4th display or a research project for school … or build explosives. As a result, giving a helpful answer could be harmless or harmful depending on the user’s (apparent) intent. This kind of prompt is dual-use: a question with unclear intent, where information could be used in benign or malicious ways. Dual-use problems are especially prevalent in risk areas such as biology and cybersecurity. In the past, production models such as ChatGPT…
261dHardware#training#safety
263d ago
Introducing gpt-oss
Introducing gpt-oss gpt-oss-120b and gpt-oss-20b push the frontier of open-weight reasoning models We’re releasing gpt-oss-120b and gpt-oss-20b—two state-of-the-art open-weight language models that deliver strong real-world performance at low cost. Available under the flexible Apache 2.0 license, these models outperform similarly sized open models on reasoning tasks, demonstrate strong tool use capabilities, and are optimized for efficient deployment on consumer hardware. They were trained using a mix of reinforcement learning and techniques informed by OpenAI’s most advanced internal models, including o3 and other frontier systems. The gpt-oss-120b model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks, while running efficiently on a single 80 GB GPU. The gpt-oss-20b model delivers similar results to OpenAI o3‑mini on common benchmarks and can run on edge devices with just 16 GB of memory, making it ideal for on-device use cases, local inference, or rapid…
263dHardware
396d ago
Addendum to GPT-4o System Card: 4o image generation
Addendum to GPT‑4o System Card: 4o image generation 4o image generation is a new, significantly more capable image generation approach than our earlier DALL·E 3 series of models. It can create photorealistic output. It can take images as inputs and transform them. It can follow detailed instructions, including reliably incorporating text into images. And because it is embedded natively, deep in the architecture of our omnimodal GPT‑4o model, 4o image generation can use everything it knows to apply these capabilities in subtle and expressive ways, creating images that are not only beautiful, but also useful. 4o image generation benefits from our existing safety infrastructure, and from lessons we have learned deploying DALL·E and Sora. At the same time, these new capabilities also bring some new risks. This addendum to the GPT‑4o system card describes the marginal risks we’ve focused on,…
396dHardware#gpt#multimodal
449d ago
OpenAI o3-mini
We’re releasing OpenAI o3‑mini, the newest, most cost-efficient model in our reasoning series, available in both ChatGPT and the API today. Previewed in December 2024, this powerful and fast model advances the boundaries of what small models can achieve, delivering exceptional STEM capabilities—with particular strength in science, math, and coding—all while maintaining the low cost and reduced latency of OpenAI o1‑mini. OpenAI o3‑mini is our first small reasoning model that supports highly requested developer features including function calling(opens in a new window), Structured Outputs(opens in a new window), and developer messages(opens in a new window), making it production-ready out of the gate. Like OpenAI o1‑mini and OpenAI o1‑preview, o3‑mini will support streaming(opens in a new window). Also, developers can choose between three reasoning effort(opens in a new window) options—low, medium, and high—to optimize for their specific use cases. This flexibility…
449dHardware#gpt#coding
467d ago
OpenAI’s Economic Blueprint
OpenAI’s Economic Blueprint The Blueprint outlines policy proposals for how the US can maximize AI’s benefits, bolster national security, and drive economic growth. Today, OpenAI is releasing a new Economic Blueprint that lays out our policy proposals for extending America’s global leadership in AI innovation, ensuring equitable access to AI, and driving economic growth across communities nationwide. As AI becomes more advanced, we believe America needs to act now to maximize the technology’s possibilities while minimizing its harms. AI is too powerful to be led and shaped by autocrats, but that is the growing risk we face, while the economic opportunity AI presents is too compelling to forfeit. Shared prosperity is as near and measurable as the new jobs and growth(opens in a new window) that will come from building more AI infrastructure like data centers, chip manufacturing facilities, and…
467dHardware
[PB]PyTorch Blog· 3 articlesvisit →
17d ago
Faster Diffusion on Blackwell: MXFP8 and NVFP4 with Diffusers and TorchAO
Diffusion models for image and video generation have been surging in popularity, delivering super-realistic visual media. However, their adoption is often constrained by the sheer requirements in memory and compute. Quantization is essential for efficient serving of these models. In this post, we demonstrate reproducible end-to-end inference speedups of up to 1.26x with MXFP8 and 1.68x with NVFP4 with diffusers and torchao on the Flux.1-Dev, QwenImage, and LTX-2 models on NVIDIA B200. We also outline how we used selective quantization, CUDA Graphs, and LPIPS as a measure to iterate on the accuracy and optimal performance of these models. The code to reproduce the experiments in this post is here. Table of contents: - Background on MXPF8 and NVFP4 - Basic Usage with Diffusers and TorchAO - Benchmark Results - Technical Considerations Background on MXFP8 and NVFP4 MXFP8 and NVFP4 are…
17dHardware#multimodal#gpuby Vasiliy Kuznetsov (Meta) and Sayak Paul (Hugging Face)
31d ago
Enabling Up to 41% Faster Pre-training: MXFP8 and DeepEP for DeepSeek-V3 on B200 with TorchTitan
TL;DR In a joint effort between PyTorch and Nebius, we enabled training DeepSeek-V3 Mixture-of-Experts models (16B and 671B) on a 256-GPU NVIDIA B200 cluster using TorchTitan. We evaluated two orthogonal optimizations on top of a BF16 baseline: MXFP8 training (via TorchAO) and DeepEP communication acceleration (via DeepEP). The highlights: - DeepSeek-V3 671B: DeepEP alone yields 859 token/sec (+32%) over the BF16 baseline (651 token/sec). Adding MXFP8 on grouped GEMMs and combining that with DeepEP pushes the performance to 918 token/sec, a +41% total throughput gain. - DeepSeek-V3 16B MoE: Loss convergence experiments over 1,500 steps confirm that MXFP8 training is equivalent to BF16 (No degradation in convergence behavior). All experiments ran on Nebius Cloud using open-source PyTorch-native tooling and are fully reproducible. Please refer to the last section (Reproducibility), to get access to all recipes. Why This Experiment Training frontier-scale…
31dHardware#training#gpuby PyTorch and Nebius (Hooman Ramezani) Teams
36d ago
PyTorch 2.10+TorchAO: Powering AIPC scenarios on Intel® Core™ Ultra Series 3 processors
Overview We are excited to introduce the highlights of Intel® Core™ Ultra Series 3 processors and the advancements we have made in PyTorch to enable users to unlock a wider range of AI scenarios on PC and edge computing. Intel® Core™ Ultra Series 3 processors with Arc B-series GPU The latest Intel® Core™ Ultra Series 3 processors feature a series of improvements to boost AI capabilities and performance of mobile PCs and edge systems, including a larger integrated GPU: - New Xe3 architecture - Up 12 Xe-cores GPU configuration - Up to 96 XMX AI engines offering up to 120 TOPs - Up to 96GB of fast LPDDR5x-9600 The combination of dense matrix multiplication capabilities in the GPU with access to full system memory bandwidth gives Intel® Core™ Ultra Series 3 processors unique capabilities in the segment to run larger…
36dHardwareby Intel PyTorch and Client AI SW team
[RB]Replicate Blog· 2 articlesvisit →
292d ago
Compare AI video models
Compare AI video models Posted July 7, 2025 by Last updated: August 11, 2025. It’s hard keeping up with every new video model. In this post we’ll help you pick the best one for your needs. We’ll break this down into two parts: - key model specs like price, resolution, duration, fps, speed, and date of release - features like text-to-video, image-to-video, subject references, and native audio Every video model is available for commercial use on Replicate. Specs Where a price range is given, it’s from the lowest-priced to the highest-priced video (based on duration and resolution). Generation speed is also a range from the fastest to the slowest. Times and prices are correct as of July 7, 2025. Video generation speed can improve over time, as the model is optimized or switched to better hardware.
292dHardware#multimodal
344d ago
NVIDIA H100 GPUs are here
NVIDIA H100 GPUs are here You can now run NVIDIA H100 GPUs on Replicate. You can also now use 2x, 4x, and 8x configurations of A100s and L40S GPUs. These were previously only available in deployments, but now you can use them for regular models and training runs. If you’ve been waiting to speed up your model or try something more powerful, now’s a good time. H100 pricing 1x H100s are now available to everyone. 2x, 4x, and 8x H100s are currently reserved for committed spend contracts. Email us at team@replicate.com if you want access. A100 pricing (2x, 4x, 8x) These multi-GPU setups for A100s are now available for models (they were already available for deployments): See the full hardware pricing list for more details. L40S pricing (2x, 4x, 8x) These multi-GPU setups for L40S GPUs are now available for…
344dHardware#gpu
[SWB]Simon Willison Blog· 3 articlesvisit →
9d ago
llm-anthropic 0.25
16th April 2026 - New model: claude-opus-4.7 , which supportsthinking_effort :xhigh . #66- New thinking_display andthinking_adaptive boolean options.thinking_display summarized output is currently only available in JSON output or JSON logs.- Increased default max_tokens to the maximum allowed for each model.- No longer uses obsolete structured-outputs-2025-11-13 beta header for older models. Recent articles - DeepSeek V4 - almost on the frontier, a fraction of the price - 24th April 2026 - Extract PDF text in your browser with LiteParse for the web - 23rd April 2026 - A pelican for GPT-5.5 via the semi-official Codex backdoor API - 23rd April 2026
10d ago
datasette-ports 0.3
15th April 2026 A small update for my tool for helping me figure out what all of the Datasette instances on my laptop are up to. - Show working directory derived from each PID - Show the full path to each database file Output now looks like this: http://127.0.0.1:8007/ - v1.0a26 Directory: /Users/simon/dev/blog Databases: simonwillisonblog: /Users/simon/dev/blog/simonwillisonblog.db Plugins: datasette-llm datasette-secrets http://127.0.0.1:8001/ - v1.0a26 Directory: /Users/simon/dev/creatures Databases: creatures: /tmp/creatures.db Recent articles - DeepSeek V4 - almost on the frontier, a fraction of the price - 24th April 2026 - Extract PDF text in your browser with LiteParse for the web - 23rd April 2026 - A pelican for GPT-5.5 via the semi-official Codex backdoor API - 23rd April 2026
10dHardware
13d ago
Gemma 4 audio with MLX
12th April 2026 Thanks to a tip from Rahim Nathwani, here's a uv run recipe for transcribing an audio file on macOS using the 10.28 GB Gemma 4 E2B model with MLX and mlx-vlm: uv run --python 3.13 --with mlx_vlm --with torchvision --with gradio \ mlx_vlm.generate \ --model google/gemma-4-e2b-it \ --audio file.wav \ --prompt "Transcribe this audio" \ --max-tokens 500 \ --temperature 1.0 I tried it on this 14 second .wav file and it output the following: This front here is a quick voice memo. I want to try it out with MLX VLM. Just going to see if it can be transcribed by Gemma and how that works. (That was supposed to be "This right here..." and "... how well that works" but I can hear why it misinterpreted that as "front" and "how that works".) Recent articles -…
13dHardware#multimodal
[TVA]The Verge AI· 4 articlesvisit →
3d ago
AI failure could trigger the next financial crisis, warns Elizabeth Warren
“I know a bubble when I see one.” AI failure could trigger the next financial crisis, warns Elizabeth Warren ‘The first big stumble will have everyone running for the exits.’ ‘The first big stumble will have everyone running for the exits.’ That’s what Sen. Elizabeth Warren (D-MA), who led the push to create a new consumer financial regulator in the wake of the 2008 recession, told a crowd at a Vanderbilt Policy Accelerator event in Washington, DC, on Wednesday. Warren warned of what she called “striking” parallels to that crisis in the AI industry. While she believes the technology has “enormous potential,” she warned that AI companies’ massive spending and borrowing practices are creating a tinderbox and Congress should step in. Though the AI industry has grown rapidly, Warren said the pace isn’t keeping up with their spending, requiring them…
3dHardwareby Lauren Feiner
4d ago
Framework’s first eGPUs turn its laptop into a desktop PC
Remember when Framework made the first laptop where you can easily upgrade its entire internal video card in three minutes flat? The company’s getting into the external graphics game, too. As promised last August, you’ll be able to turn the Framework Laptop 16’s GPU modules into external ones instead. Or, you can plug in a desktop graphics card (or network card, or other PCIe cards) for more power than most laptops ever dream of having, with eight lanes of PCI-Express bandwidth. Framework’s first eGPUs turn its laptop into a desktop PC Only power users need apply — and you’ll have to shut down the laptop before you plug or unplug. Only power users need apply — and you’ll have to shut down the laptop before you plug or unplug. Framework’s calling it the OCuLink Dev Kit, because it uses the…
4dHardware#multimodalby Sean Hollister
4d ago
John Ternus’ first big problem is AI
Less than a year ago, Apple made headlines for a lack of AI announcements at its annual WWDC event. Ten months later, the company has announced that hardware executive John Ternus will succeed longtime CEO Tim Cook as chief executive — and the official release doesn’t mention AI once. John Ternus’ first big problem is AI Does Tim Cook’s newly announced successor have what it takes to regain the company’s lost ground in the AI race? John Ternus’ first big problem is AI Does Tim Cook’s newly announced successor have what it takes to regain the company’s lost ground in the AI race? Ternus, currently Apple’s SVP of hardware engineering, will take over as CEO on September 1st, after Cook’s decade and a half in the role. Ternus is a 25-year veteran of the company and the first Apple CEO…
4dHardwareby Hayden Field
5d ago
Silicon Valley has forgotten what normal people want
One of the most mortifying things about knowing a lot of techies is listening to them tell me excitedly about some very important discovery that they believe they have made. Recently, I ran into an acquaintance of mine, who began talking my ear off about an amazing discovery he’d made with LLMs. Knowledge, it turns out, is structured into language! You could put one word into ChatGPT and it might understand what you wanted, or make up a word and see if it understood what you meant! These amazing new tools have revealed that the English corpus contains so much about its speakers! Silicon Valley has forgotten what normal people want What NFTs, AI and the metaverse tell us about “thought leadership” Silicon Valley has forgotten what normal people want What NFTs, AI and the metaverse tell us about “thought…
5dHardwareby Elizabeth Lopatto
[WA]Wired AI· 2 articlesvisit →
5d ago
Prego Has a Dinner-Conversation-Recording Device, Capisce?
Prego, the pasta sauce company, is getting into hardware with a device that sits on your table and records dinner conversations. No, this isn’t April Fools’. The Connection Keeper is a round puck that houses two microphones for recording around the table. The recorder was developed in partnership with StoryCorps, the 23-year-old nonprofit that has recorded conversations with more than 720,000 people about their lives. The Connection Keeper is more of a publicity stunt than a readily available product. Fewer than 100 will be made. The pucks look more like a tuna can than what you’d associate with the pasta sauce brand—small and meant to be tucked aside so as not to attract attention. The whole goal here, Prego and StoryCorps say, is to advocate for keeping people off their phones during dinner. “Everything now is AI, and everyone has…
5dHardwareby Boone Ashworth
7d ago
Schematik Is ‘Cursor for Hardware.’ Anthropic Wants In
Samuel Beek knew he had a problem when he blew every fuse in his house. The culprit was an electric door opener he had built himself, guided by instructions for wiring and piecing together a device drummed up by ChatGPT. Turns out, the chatbot wasn’t so great at distinguishing between wet and dry connections, so the device he had built sent out a surge of misallocated power that zapped everything else. Oops. Beek, based in Amsterdam, admits he is not a hardware guy. But he had that itch and now really just wanted to make something that wouldn’t explode. “That's the difference: Your fuses blow out, or you have a solid product,” Beek says. “That was kind of a learning experience for me to be more careful, but also to build AI that deeply understands what it's talking about.” He…
7dHardware#codingby Boone Ashworth