★ TOP STORY[ OAI ]Research·2d ago

Introducing GPT-5.5

Update on April 24, 2026: GPT‑5.5 and GPT‑5.5 Pro are now available in the API. The system card has also been updated to describe the additional safeguards that apply. We’re releasing GPT‑5.5, our smartest and most intuitive to use model yet, and the next step toward a new way of getting work done on a computer. GPT‑5.5 understands what you’re trying to do faster and can carry more of the work itself. It excels at writing and debugging code, researching online, analyzing data, creating documents and spreadsheets, operating software, and moving across tools until a task is finished. Instead of carefully managing every step, you can give GPT‑5.5 a messy, multi-part task and trust it to plan, use tools, check its work, navigate through ambiguity, and keep going. The gains are especially strong in agentic coding, computer use, knowledge work,…

OpenAI Blogread →

▲ trending · last 48hview all →

▾[ANT]Anthropic News· 5 articlesvisit →

12d ago

Apr 14, 2026 Announcements Anthropic’s Long-Term Benefit Trust appoints Vas Narasimhan to Board of Directors

Anthropic’s Long-Term Benefit Trust appoints Vas Narasimhan to Board of Directors Vas Narasimhan has been appointed to Anthropic's Board of Directors by the Anthropic Long-Term Benefit Trust. He is a physician-scientist and the Chief Executive Officer of Novartis—one of the world's leading innovative medicines companies—and shares Anthropic’s conviction that healthcare and life sciences are among the areas where AI has the greatest potential to improve the quality of human life. “Vas brings something rare to our board. He's overseen the development and approval of more than 35 novel medicines for the benefit of patients around the world in one of the most regulated industries,” said Daniela Amodei, Co-founder and President of Anthropic. “Getting powerful new technology to people safely and at scale is what we think about every day at Anthropic. Vas has been doing exactly that for years, and…

12dResearch#safety

19d ago

Apr 6, 2026 Announcements Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute

Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute We have signed a new agreement with Google and Broadcom for multiple gigawatts of next-generation TPU capacity that we expect to come online starting in 2027. This significant expansion of our compute infrastructure will power our frontier Claude models and help us serve extraordinary demand from customers worldwide. “This groundbreaking partnership with Google and Broadcom is a continuation of our disciplined approach to scaling infrastructure: we are building the capacity necessary to serve the exponential growth we have seen in our customer base while also enabling Claude to define the frontier of AI development,” said Krishna Rao, CFO of Anthropic. “We are making our most significant compute commitment to date to keep pace with our unprecedented growth.” Demand from Claude customers has accelerated in 2026. Our run-rate…

19dResearch#safety

25d ago

Mar 31, 2026 Announcements Australian government and Anthropic sign MOU for AI safety and research

Australian government and Anthropic sign MOU for AI safety and research Today, Anthropic signed a Memorandum of Understanding with the Australian government to cooperate on AI safety research and support the goals of Australia’s National AI Plan. Our CEO, Dario Amodei, met with Prime Minister Anthony Albanese to formalize the agreement during a visit to Canberra, Australia. We also announced AUD$3 million in partnerships with leading Australian research institutions to use Claude to improve disease diagnosis and treatment and support computer science education and research. Central to the MOU is a commitment to work with Australia’s AI Safety Institute. We will share our findings on emerging model capabilities and risks, participate in joint safety and security evaluations, and collaborate on research with Australian academic institutions. This mirrors the arrangements we have with safety institutes in the US, UK, and Japan,…

25dResearch#safety

46d ago

Mar 10, 2026 Announcements Sydney will become Anthropic’s fourth office in Asia-Pacific

Sydney will become Anthropic’s fourth office in Asia-Pacific Anthropic is expanding to Australia and New Zealand. In the coming weeks, we will open an office in Sydney—our fourth office in Asia-Pacific, alongside Tokyo, Bengaluru, and Seoul. The expansion reflects strong demand from businesses in Australia and New Zealand and will help us better serve the countries’ unique AI ecosystems. In addition to hiring a team in Sydney, we plan to deepen our engagement with Australian institutions, as well as collaborate on projects that advance Australia’s national interests and priority sectors. Our executive team will visit Australia at the end of March to formalize some of these partnerships and meet with customers and policymakers. “We’re excited by the ways organizations in Australia and New Zealand are applying AI to areas of national importance—financial services, agricultural technology, clean energy innovation, healthcare delivery,…

46dResearch#safety

50d ago

Mar 6, 2026 Policy Partnering with Mozilla to improve Firefox’s security

Subscribe to the Frontier Red Team newsletter Get updates on our latest red-teaming research and findings. AI models can now independently identify high-severity vulnerabilities in complex software. As we recently documented, Claude found more than 500 zero-day vulnerabilities (security flaws that are unknown to the software’s maintainers) in well-tested open-source software. In this post, we share details of a collaboration with researchers at Mozilla in which Claude Opus 4.6 discovered 22 vulnerabilities over the course of two weeks. Of these, Mozilla assigned 14 as high-severity vulnerabilities—almost a fifth of all high-severity Firefox vulnerabilities that were remediated in 2025. In other words: AI is making it possible to detect severe security vulnerabilities at highly accelerated speeds. As part of this collaboration, Mozilla fielded a large number of reports from us, helped us understand what types of findings warranted submitting a bug…

50dResearch#safety

▾[ATA]Ars Technica AI· 3 articlesvisit →

3d ago

Indian med student rakes in thousands with AI-generated MAGA hottie

Like many medical school students, Sam was broke. The 22-year-old aspiring orthopedic surgeon from northern India got some money from his parents, but he says he spent most of it subsidizing his licensing exams, and he’s still saving up to hopefully emigrate to the US after graduation. So he started searching for ways to make additional money online. Sam, who requested a pseudonym to avoid jeopardizing his medical career and immigration status, tried a few things, with varying degrees of legitimacy and success. He made YouTube shorts and sold study notes to other med students. It wasn’t until he started scrolling through his Instagram feed that he landed on an idea: Why not make an AI-generated girl using Google Gemini’s Nano Banana Pro and sell bikini photos of her online? But when Sam started posting generic photos of a beautiful,…

3dResearch#geminiby Ej Dickson, wired.com

4d ago

Mozilla: Anthropic's Mythos found 271 security vulnerabilities in Firefox 150

Earlier this month, Anthropic said its Mythos Preview model was so good at finding cybersecurity vulnerabilities that the company was limiting its initial release to “a limited group of critical industry partners.” Since then, debate has raged over whether the model presages an era of turbocharged AI-aided hacking or if Anthropic is just building hype for what is a relatively normal step up on the ladder of advancing AI capabilities. Mozilla added some important data to that debate Tuesday, writing in a blog post that early access to Mythos Preview had helped it pre-identify 271 security vulnerabilities in this week’s release of Firefox 150. The results were significant enough to get Firefox CTO Bobby Holley to enthuse that, in the never-ending battle between cyberattackers and cyberdefenders, “defenders finally have a chance to win, decisively.” “We’ve rounded the curve” Holley didn’t…

4dResearchby Kyle Orland

8d ago

Satellite and drone images reveal big delays in US data center construction

Silicon Valley has been pouring hundreds of billions of dollars into building ever-larger AI data centers that require as much electricity as hundreds of thousands of US homes—but that massive buildout faces significant construction and power challenges along with growing local resistance. Now satellite imagery is showing that nearly 40 percent of US data center projects may fail to be completed this year as scheduled. The Financial Times drew upon satellite imagery from the geospatial data analytics company SynMax showing how much progress has been made in clearing land and laying building foundations for each data center project. It also cross-checked project progress against public statements and permit documents compiled by the industry research group IIR Energy. The resulting analysis revealed how major projects from tech companies such as Microsoft, Oracle, and OpenAI are “likely to miss completion dates by…

8dResearch#localby Jeremy Hsu

▾[CB]Cerebras Blog· 2 articlesvisit →

3d ago

OpenAI GPT-OSS 120B Benchmarked – NVIDIA Blackwell vs. Cerebras November 06, 2025

Nvidia Blackwell is upgrade over Hopper, with top speed of GPU inference by 2-3x and leapfrogging small-chip AI competitors. Cerebras outperforms Nvidia,

3dResearch#inference#benchmark#gpu

3d ago

Case Study - Cognition x Cerebras December 10, 2025

Powered by Cerebras Inference, Cognition’s SWE-1.5 and the SWE-grep family deliver frontier-level coding performance up to 13x faster than general-purpose models—keeping developers in flow while they explore codebases, ship features, and debug complex systems

3dResearch#inference#coding

▾[FAB]Fireworks AI Blog· 1 articlesvisit →

54d ago

2/3/2026 The Benchmark Gap: What It Takes to Ship Kimi K2.5

The Benchmark Gap: What It Takes to Ship Kimi K2.5 Kimi K2.5 is live on Fireworks at ~1/10 the cost and 2-3x the speed of closed frontier models. As the fastest open-source provider of Kimi K2.5, Fireworks is seeing unprecedented model adoption. Kimi K2.5 is a landmark release for open models with benchmark results on par with top closed models and unprecedented visual coding quality. But enabling full quality in production requires more than just hosting the model. Here's how Fireworks ensures that developers get the best quality on our platform and how that translates into specific edge cases. How We Approach Quality at Fireworks Deploying frontier open models has taught us that quality emerges or degrades in the gaps: between the model and serving stack, between the chat template on Hugging Face and what’s running in the first-party API.…

54dResearch#inference#multimodal#benchmark

▾[GDM]Google DeepMind Blog· 8 articlesvisit →

24d ago

The latest AI news we announced in March 2026

The latest AI news we announced in March 2026 For more than 20 years, we’ve invested in machine learning and AI research, tools and infrastructure to build products that make everyday life better for more people. Teams across Google are working on ways to unlock AI’s benefits in fields as wide-ranging as healthcare, crisis response and education. To keep you posted on our progress, we're doing a regular roundup of Google's most recent AI news. Here’s a look back at some of our AI announcements from March. This March, we focused on making AI feel even more helpful to your day-to-day world. We introduced updates to help Gemini understand your specific context — from your travel plans and work projects to your shopping preferences — giving you the option to turn your devices into proactive helpers. Whether you’re vibe coding…

24dResearch#gemini#codingby The Keyword Team

39d ago

Measuring progress toward AGI: A cognitive framework

Measuring progress toward AGI: A cognitive framework Artificial General Intelligence (AGI) has the potential to accelerate scientific discovery and help solve some of humanity’s most pressing problems. But it can be difficult to know how close we are to this key milestone, because there’s a lack of empirical tools for evaluating systems’ general intelligence. Tracking progress toward AGI will require a wide range of methods and approaches, and we believe cognitive science provides one important piece of the puzzle. That’s why today, we’re releasing a new paper, “Measuring Progress Toward AGI: A Cognitive Taxonomy,” that presents a scientific foundation for understanding the cognitive capabilities of AI systems. Alongside the paper, we are partnering with Kaggle to launch a hackathon, inviting the research community to help build the evaluations needed to put this framework into practice. Deconstructing general intelligence Our framework…

39dResearchby Oran Kelly

53d ago

Create new worlds in Project Genie with these 4 tips

Create new worlds in Project Genie with these 4 tips We recently introduced Project Genie, an experimental research prototype that lets you create, explore and remix your own interactive worlds. With Project Genie, you can develop worlds with characters and environments, then navigate them in real time, like by journeying to a new, imaginary planet or diving underwater with sea creatures. Project Genie is currently available to Google AI Ultra Subscribers in the U.S. over 18, with plans to expand further. You can prompt Project Genie with just text, or with text and images. If you’re ready to bring your imaginary world to life, here are some tips on how to prompt Project Genie as well as features to try. 1. Describe the environment in detail Start by writing out what kind of environment you want — for example, you…

53dResearchby Molly McHugh-Johnson

59d ago

Ask a Techspert: What’s a world model?

Ask a Techspert: What’s a world model? We recently introduced Project Genie, an experimental research prototype that lets you create, explore and remix your own interactive worlds. Project Genie is powered by what’s called a “world model.” It’s currently available to Google AI Ultra subscribers in the U.S. over 18 with plans to expand further. Now, you’ve probably heard of large language models, machine learning models, image generation models and so on…but “world model” might be a new one. To help explain the concept, we sat down with Googlers Shlomi Fruchter and Jack Parker-Holder. Congratulations on the launch of Project Genie! What were your roles on the team? Shlomi: Jack and I co-lead Genie development. I mostly focus on our next-generation video and world models and working with the team to research new improvements. Jack: I'm a research scientist as…

59dResearch#multimodalby Molly McHugh-Johnson

72d ago

Gemini 3 Deep Think: Advancing science, research and engineering

Gemini 3 Deep Think: Advancing science, research and engineering Today, we’re releasing a major upgrade to Gemini 3 Deep Think, our specialized reasoning mode, built to push the frontier of intelligence and solve modern challenges across science, research, and engineering. We updated Gemini 3 Deep Think in close partnership with scientists and researchers to tackle tough research challenges — where problems often lack clear guardrails or a single correct solution and data is often messy or incomplete. By blending deep scientific knowledge with everyday engineering utility, Deep Think moves beyond abstract theory to drive practical applications. The new Deep Think is now available in the Gemini app for Google AI Ultra subscribers and, for the first time, we’re also making Deep Think available via the Gemini API to select researchers, engineers and enterprises. Express interest in early access here. Here…

72dResearch#geminiby The Deep Think team

80d ago

The latest AI news we announced in January

The latest AI news we announced in January For more than 20 years, we’ve invested in machine learning and AI research, tools and infrastructure to build products that make everyday life better for more people. Teams across Google are working on ways to unlock AI’s benefits in fields as wide-ranging as healthcare, crisis response and education. To keep you posted on our progress, we're doing a regular roundup of Google's most recent AI news. Here’s a look back at some of our AI announcements from January. In January, we moved AI toward a new era of Personal Intelligence: making products like Search, Chrome and the Gemini app more proactive than ever. Whether it’s Chrome’s “auto browse” handling your complex chores or Gmail surfacing what matters most, these new personalization features are focused on anticipating your needs, understanding your context and…

80dResearch#geminiby Keyword Team

82d ago

Advancing AI benchmarking with Game Arena

Advancing AI benchmarking with Game Arena Chess is a game of perfect information. The real world is not. Last year, Google DeepMind partnered with Kaggle to launch Game Arena, an independent, public benchmarking platform where AI models compete in strategic games. We started with chess to measure reasoning and strategic planning. But in the real world, decisions are rarely based on complete information. This is why we are now expanding Kaggle Game Arena with two new game benchmarks to test frontier models on social deduction and calculated risk. Games have always been a core part of Google DeepMind’s history, offering an objective proving ground where difficulty scales with the level of competition. As AI systems become more general, mastering diverse games demonstrates their proficiency across distinct cognitive skills. Beyond measuring performance, games can also serve as controlled sandbox environments to…

82dResearch#benchmarkby Oran Kelly

86d ago

Project Genie: Experimenting with infinite, interactive worlds

Project Genie: Experimenting with infinite, interactive worlds In August, we previewed Genie 3, a general-purpose world model capable of generating diverse, interactive environments. Even in this early form, trusted testers were able to create an impressive range of fascinating worlds and experiences, and uncovered entirely new ways to use it. The next step is to broaden access through a dedicated, interactive prototype focused on immersive world creation. Starting today, we're rolling out access to Project Genie for Google AI Ultra subscribers in the U.S (18+). This experimental research prototype lets users create, explore and remix their own interactive worlds. How we’re advancing world models A world model simulates the dynamics of an environment, predicting how they evolve and how actions affect them. While Google DeepMind has a history of agents for specific environments like Chess or Go, building AGI requires…

86dResearchby Suz Chambers

▾[HF]Hugging Face Blog· 7 articlesvisit →

4d ago

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard QIMMA validates benchmarks before evaluating models, ensuring reported scores reflect genuine Arabic language capability in LLMs. 🏆 Leaderboard · 🔧 GitHub · 📄 Paper If you've been tracking Arabic LLM evaluation, you've probably noticed a growing tension: the number of benchmarks and leaderboards is expanding rapidly, but are we actually measuring what we think we're measuring? We built QIMMA قمّة (Arabic for "summit"), to answer that question systematically. Instead of aggregating existing Arabic benchmarks as-is and running models on them, we applied a rigorous quality validation pipeline before any evaluation took place. What we found was sobering: even widely-used, well-regarded Arabic benchmarks contain systematic quality issues that can quietly corrupt evaluation results. This post walks through what QIMMA is, how we built it, what problems we found, and what the model rankings…

4dResearch#benchmark

32d ago

A New Framework for Evaluating Voice Agents (EVA)

A New Framework for Evaluating Voice Agents (EVA) Introduction Conversational voice agents present a distinct evaluation challenge: they must simultaneously satisfy two objectives — accuracy (completing the user's task correctly and faithfully) and conversational experience (doing so naturally, concisely, and in a way appropriate for spoken interaction). These objectives are deeply intertwined: mishearing a confirmation code renders perfect LLM reasoning meaningless, a wall of options overwhelms a caller who can't skim spoken output, and delayed responses can pass every accuracy check while remaining unusable in practice. Existing frameworks treat these as separate concerns — evaluating task success or conversational dynamics, but not both. We introduce EVA, an end-to-end evaluation framework for conversational voice agents that evaluates complete, multi-turn spoken conversations using a realistic bot-to-bot architecture. EVA produces two high-level scores, EVA-A (Accuracy) and EVA-X (Experience), and is designed to surface…

32dResearch#coding

36d ago

Build a Domain-Specific Embedding Model in Under a Day

Build a Domain-Specific Embedding Model in Under a Day With a single GPU and less than a day of training time, you can transform a general-purpose embedding model into one that truly understands your domain, no manual labeling required. To help you hit the ground running, we are also releasing a ready-to-use synthetic training dataset generated from NVIDIA's public documentation using this exact pipeline. Using this data and the recipe, we saw over 10% improvement in both Recall@10 and NDCG@10. Atlassian applied this recipe to fine-tune on their JIRA dataset, increasing Recall@60 from 0.751 to 0.951, a 26% improvement - on a single GPU. 🔗Quick Links to Dataset and Code: 🧑💻Open Source Projects Recipe Integrates: - NeMo Data Designer for synthetic data generation - NeMo Automodel for embedding model training - BEIR for Information retrieval evaluation - NeMo Export-Deploy for…

36dResearch#fine-tuning#training#embeddings#open-source

47d ago

Ulysses Sequence Parallelism: Training with Million-Token Contexts

Ulysses Sequence Parallelism: Training with Million-Token Contexts Ulysses Sequence Parallelism (part of the Arctic Long Sequence Training (ALST) protocol from Snowflake AI Research) provides an elegant solution by distributing the attention computation across multiple GPUs through attention head parallelism. In this post, we'll explore how Ulysses works and how it's been integrated across the Hugging Face ecosystem—from Accelerate to the Transformers Trainer and TRL's SFTTrainer. Contents - The Challenge of Long Sequence Training - How Ulysses Works - Integration with Accelerate - Integration with Transformers Trainer - Integration with TRL's SFTTrainer - Comparing Ulysses and Ring Attention - Best Practices - Benchmarks - Resources The Challenge of Long Sequence Training The attention mechanism in transformers scales quadratically with sequence length. For a sequence of length , standard attention requires FLOPs and memory to compute and store the attention score matrix.…

47dResearch#fine-tuning#benchmark#training

72d ago

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments OpenEnv is an open-source framework from Meta and Hugging Face designed to address this challenge by standardizing how agents interact with real environments. As part of this collaboration, Turing contributed a production-grade calendar management environment to study tool-using agents under realistic constraints such as access control, temporal reasoning, and multi-agent coordination. In this post, we explore how OpenEnv works in practice, why calendars serve as a powerful benchmark for real-world agent evaluation, and what our findings reveal about the current limitations of tool-using agents. What Is OpenEnv? OpenEnv is a framework for evaluating AI agents against real systems rather than simulations. It provides a standardized way to connect agents to real tools and workflows while preserving the structure needed for consistent and reliable evaluation. OpenEnv uses a gym-oriented API (reset ,…

72dResearch#agents#inference#benchmark#open-source

80d ago

Community Evals: Because we're done trusting black-box leaderboards over the community

Community Evals: Because we're done trusting black-box leaderboards over the community TL;DR: Benchmark datasets on Hugging Face can now host leaderboards. Models store their own eval scores. Everything links together. The community can submit results via PR. Verified badges prove that the results can be reproduced. Evaluation is broken Let's be real about where we are with evals in 2026. MMLU is saturated above 91%. GSM8K hit 94%+. HumanEval is conquered. Yet some models that ace benchmarks still can't reliably browse the web, write production code, or handle multi-step tasks without hallucinating, based on usage reports. There is a clear gap between benchmark scores and real-world performance. Furthermore, there is another gap within reported benchmark scores. Multiple sources report different results. From Model Cards, to papers, to evaluation platforms, there is no alignment in reported scores. The result is that…

80dResearch#benchmark

81d ago

H Company's new Holo2 model takes the lead in UI Localization

H Company's new Holo2 model takes the lead in UI Localization Two months since releasing our first batch of Holo2 models, H Company is back with our largest UI localization model yet: Holo2-235B-A22B Preview. This model achieves a new State-of-the-Art (SOTA) record of 78.5% on Screenspot-Pro and 79.0% on OSWorld G. Available on Hugging Face, Holo2-235B-A22B Preview is a research release focused on UI element localization. Agentic Localization High-resolution 4K interfaces are challenging for localization models. Small UI elements can be difficult to pinpoint on a large display. With agentic localization, however, Holo2 can iteratively refine its predictions, improving accuracy with each step and unlocking 10-20% relative gains across all Holo2 model sizes. Holo2-235B-A22B's Performance on ScreenSpot-Pro Holo2-235B-A22B Preview reaches 70.6% accuracy on ScreenSpot-Pro in a single step. In agent mode, it achieves 78.5% within 3 steps, setting a new…

81dResearch#agents

▾[MRB]Microsoft Research Blog· 8 articlesvisit →

5d ago

Can we AI our way to a more sustainable world?

Technical advancement is moving at such a rapid pace that it can be challenging to define the tomorrow we’re working toward. In The Shape of Things to Come, Microsoft Research leader Doug Burger and experts from across disciplines tease out the thorniest AI issues facing technologists, policymakers, business decisionmakers, and other stakeholders today. The goal: to amplify the shared understanding needed to build a future in which the AI transition is a net positive. In this episode, Burger is joined by Amy Luers, head of sustainability science and innovation at Microsoft, and Ishai Menache, an optimization researcher at Microsoft Research, to explore how AI can both contribute to and help address climate change, emphasizing the need to separate hype from data and understand its real impact. While datacenters account for a small share of global emissions, their rapid growth raises…

5dResearchby Doug Burger, Amy Luers, Ishai Menache

16d ago

Ideas: Steering AI toward the work future we want

Behind every emerging technology is a great idea propelling it forward. In the Microsoft Research Podcast series Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets. Since 2020, researchers across Microsoft have conducted, surfaced, and analyzed key research into how people work as part of the New Future of Work research initiative. They’ve done this through a variety of lenses—from changes caused by the pandemic to the adoption of hybrid work practices to the arrival of increasingly capable AI models—with the goal of empowering people and organizations to redefine work in real time. In this episode, Microsoft Chief Scientist and Technical Fellow Jaime Teevan talks with researchers Jenna Butler, Jake Hofman, and Rebecca Janssen about the latest efforts: the Microsoft…

16dResearchby Jaime Teevan, Jenna Butler, Jake Hofman, Rebecca Janssen

16d ago

New Future of Work: AI is driving rapid change, uneven benefits

At a glance - AI is driving rapid changes in the workplace, more sharply than those covered in previous editions of the New Future of Work - AI is changing how people work together, not just enabling them to work faster or from remote locations. Organizations that treat AI as a collaborative partner are seeing the biggest benefits. - The benefits of AI are not yet evenly distributed, underscoring the need for industry leaders to build AI that expands opportunity. The future is not predetermined. It will be shaped by the choices we make today. - Human expertise matters more, not less, in an AI-powered world. People are shifting from merely doing work to guiding, critiquing, and improving the work of AI. For the past five years, the New Future of Work report has captured how work is changing. This…

16dResearchby Jaime Teevan, Sonia Jaffe, Rebecca Janssen, Nancy Baym, Siân Lindley, Bahar Sarrafzadeh, Brent Hecht, Jenna Butler, Jake Hofman, Sean Rintel

24d ago

ADeLe: Predicting and explaining AI performance across tasks

At a glance - AI benchmarks report performance on specific tasks but provide limited insight into underlying capabilities; ADeLe evaluates models by scoring both tasks and models across 18 core abilities, enabling direct comparison between task demands and model capabilities. - Using these ability scores, the method predicts performance on new tasks with ~88% accuracy, including for models such as GPT-4o and Llama-3.1. - It builds ability profiles and identifies where models are likely to succeed or fail, highlighting strengths and limitations across tasks. - By linking outcomes to task demands, ADeLe explains differences in performance, showing how it changes as task complexity increases. AI benchmarks report how large language models (LLMs) perform on specific tasks but provide little insight into their underlying capabilities that drive their performance. They do not explain failures or reliably predict outcomes on new tasks.…

24dResearch#benchmarkby Lexin Zhou, Xing Xie

30d ago

GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation

At a glance - VLM-based robot planners struggle with long, complex tasks because natural-language plans can be ambiguous, especially when specifying both actions and locations. - GroundedPlanBench evaluates whether models can plan actions and determine where they should occur across diverse, real-world robot scenarios. - Video-to-Spatially Grounded Planning (V2GP) is a framework that converts robot demonstration videos into spatially grounded training data, enabling models to learn planning and grounding jointly. - Grounded planning improves both task success and action accuracy, outperforming decoupled approaches in benchmark and real-world evaluations. Vision-language models (VLMs) use images and text to plan robot actions, but they still struggle to decide what actions to take and where to take them. Most systems split these decisions into two steps: a VLM generates a plan in natural language, and a separate model translates it into executable actions. This…

30dResearch#multimodalby Sehun Jung, HyunJee Song, Dong-Hee Kim, Reuben Tan, Jianfeng Gao, Yong Jae Lee, Donghyun Kim

30d ago

AsgardBench: A benchmark for visually grounded interactive planning

At a glance - To successfully complete tasks, embodied AI agents must ground and update their plans based on visual feedback. - AsgardBench isolates whether agents can use visual observations to revise their plans as tasks unfold. - Spanning 108 controlled task instances across 12 task types, the benchmark requires agents to adapt their plans based on what they observe. - Because objects can be in different positions and states (e.g., clean or dirty), the same instruction can require different action sequences, even in the same environment. Imagine a robot tasked with cleaning a kitchen. It needs to observe its environment, decide what to do, and adjust when things don’t go as expected, for example, when the mug it was tasked to wash is already clean, or the sink is full of other items. This is the domain of embodied…

30dResearch#benchmarkby Andrea Tupini, Lars Liden, Reuben Tan, Yu Wang, Jianfeng Gao

33d ago

Will machines ever be intelligent?

Technical advances are moving at such a rapid pace that it can be challenging to define the tomorrow we’re working toward. In The Shape of Things to Come, Microsoft Research leader Doug Burger and experts from across disciplines tease out the thorniest AI issues facing technologists, policymakers, business decision-makers, and other stakeholders today. The goal: to amplify the shared understanding needed to build a future in which the AI transition is a net positive. In this first episode of the series, Burger is joined by Nicolò Fusi of Microsoft Research and Subutai Ahmad (opens in new tab) of Numenta to examine whether today’s AI systems are truly intelligent. They compare transformer-based large language models (LLMs) with the human brain’s distributed, continuously learning architecture, exploring differences in efficiency, representation, and sensory-motor grounding. The discussion probes what intelligence really means, where current…

33dResearchby Doug Burger,  Subutai Ahmad, Nicolo Fusi

44d ago

Systematic debugging for AI agents: Introducing the AgentRx framework

At a glance - Problem: Debugging AI agent failures is hard because trajectories are long, stochastic, and often multi-agent, so the true root cause gets buried. - Solution: AgentRx (opens in new tab) pinpoints the first unrecoverable (“critical failure”) step by synthesizing guarded, executable constraints from tool schemas and domain policies, then logging evidence-backed violations step-by-step. - Benchmark + taxonomy: We release AgentRx Benchmark (opens in new tab) with 115 manually annotated failed trajectories across τ-bench, Flash, and Magentic-One, plus a grounded nine-category failure taxonomy. - Results + release: AgentRx improves failure localization (+23.6%) and root-cause attribution (+22.9%) over prompting baselines, and we are open-sourcing the framework and dataset. As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency. When…

44dResearch#agentsby Shraddha Barke, Arnav Goyal, Alind Khare, Chetan Bansal

▾[MTR]MIT Technology Review· 13 articlesvisit →

2d ago

Will fusion power get cheap? Don’t count on it.

Will fusion power get cheap? Don’t count on it. New research suggests that cost declines could be slow for the technology. Fusion power could provide a steady, zero-emissions source of electricity in the future—if companies can get plants built and running. But a new study suggests that even if that future arrives, it might not come cheap. Technologies tend to get less expensive over time. Lithium-ion batteries are now about 90% cheaper than they were in 2013. But historically, different technologies tend to go through this curve at different rates. And the cost of fusion might not sink as quickly as the prices of batteries or solar. It’s tricky to make any predictions about the cost of a technology that doesn’t exist yet. But when there’s billions of dollars of public and private funding on the line, it’s worth considering…

2dResearchby Casey Crownhart

3d ago

AI needs a strong data fabric to deliver business value

Sponsored AI needs a strong data fabric to deliver business value A modern data fabric makes it possible to turn existing enterprise knowledge into a trusted foundation for AI. In partnership withSAP Artificial intelligence is moving quickly in the enterprise, from experimentation to everyday use. Organizations are deploying copilots, agents, and predictive systems across finance, supply chains, human resources, and customer operations. By the end of 2025, half of companies used AI in at least three business functions, according to a recent survey. But as AI becomes embedded in core workflows, business leaders are discovering that the biggest obstacle is not model performance or computing power but the quality and the context of the data on which those systems rely. AI essentially introduces a new requirement: Systems must not only access data — they must understand the business context behind…

3dResearch#codingby MIT Technology Review Insights

4d ago

AI at MIT

AI at MIT In almost every lab at the Institute, researchers are delving into AI. And the tools they’re developing and deploying have already turbocharged existing methods and opened new pathways to discovery. At MIT, AI has become so pervasive that you can almost find your way into it without meaning to. Take Sili Deng, an associate professor of mechanical engineering. Deng says she still doesn’t know whether she’d have gone all in on artificial intelligence had it not been for the covid pandemic. She had joined the faculty in 2019 and was in the process of setting up her lab to study combustion kinetics, emissions reduction, and flame synthesis of energy materials when covid hit, putting a halt to all lab renovations. Because she needed to start from scratch, she challenged herself and her postdocs to try machine learning…

4dResearchby Ken Shulman

4d ago

Get ready for hotter, muggier, stormier summers

Get ready for hotter, muggier, stormier summers MIT researchers have found that an atmospheric condition called an inversion determines how oppressive heat waves get and how long they last—and the phenomenon is getting more common in parts of the United States. A long stretch of humid heat followed by a powerful thunderstorm is a familiar weather pattern in the tropics, but it’s also becoming more common in midlatitude regions such as the US Midwest. A recent study by two MIT scientists identifies a key atmospheric condition that determines how hot, humid, and stormy such a region can get: inversions, in which a layer of warm air settles over cooler air. Inversions were already known to act as an atmospheric blanket that traps pollutants at ground level. Now Funing Li, a postdoc in MIT’s Department of Earth, Atmospheric and Planetary Sciences…

4dResearchby Jennifer Chu

4d ago

Analog computing from waste heat

Analog computing from waste heat Harnessing heat generated by a device itself, microscopic silicon structures could lead to more energy-efficient thermal sensing and signal processing. Heat generated by electronic devices is usually a problem, but a team led by Giuseppe Romano, a research scientist at MIT’s Institute for Soldier Nanotechnologies, has found a way to use it for data processing that doesn’t rely on electricity. In this analog computing method, input data is encoded not as binary 1s and 0s but as a set of temperatures based on the waste heat already present in a device. The flow and distribution of that heat through tiny silicon structures, designed by a physics-based optimization algorithm they developed, forms the basis of the calculation. Then the output is represented by the power collected at the other end. The researchers used these structures to…

4dResearchby Adam Zewe

4d ago

Early life may have breathed oxygen earlier than believed

Early life may have breathed oxygen earlier than believed A new study suggests that aerobic respiration began hundreds of millions of years before oxygen became abundant in Earth’s atmosphere. Around 2.3 billion years ago, a pivotal period known as the Great Oxidation Event set the evolutionary course for oxygen-breathing life on Earth. But MIT geobiologists and colleagues have found evidence that some early forms of life evolved the ability to use oxygen hundreds of millions of years before that. By mapping enzyme sequences from several thousand modern organisms onto an evolutionary tree of life, the researchers traced the origins of an enzyme that enables organisms to use oxygen to the Mesoarchean period, 3.2 to 2.8 billion years ago. The team’s results may help explain a longstanding puzzle in Earth’s history: Given that the first oxygen-producing microbes likely emerged before the…

4dResearchby Jennifer Chu

4d ago

This tool could show how consciousness works

This tool could show how consciousness works Transcranial focused ultrasound is a noninvasive way to stimulate the brain and see how it functions. How does the physical matter in our brains translate into thoughts, sensations, and emotions? It’s hard to explore that question without neurosurgery. But in a recent paper, MIT philosopher Matthias Michel, Lincoln Lab researcher Daniel Freeman, and colleagues outline a strategy for doing so with an emerging tool called transcranial focused ultrasound. This noninvasive technology reaches deeper into the brain, with greater resolution, than techniques such as EEG and MRI. It works by sending acoustic waves through the skull to focus on an area of a few millimeters, allowing specific brain structures to be stimulated so the effects can be studied. The researchers lay out an experimental approach that would use the tool to help test two…

4dResearchby Peter Dizikes

4d ago

The new word in home construction could be “plastics”

The new word in home construction could be “plastics” MIT engineers are using recycled polymers to 3D-print construction-grade floor trusses. Single-use plastics are a persistent source of environmental pollution, and the need to house a growing global population puts increasing pressure on resources such as timber. MIT engineers have an idea that could make a dent in both problems at once. In a recent study, a team led by mechanical engineering professor David Hardt, SM ’74, PhD ’79, and lecturer and research scientist AJ Perez ’13, MEng ’14, PhD ’23, laid out a plan for using recycled plastic to 3D-print construction-grade beams, trusses, and other structures that could one day offer lighter, more sustainable alternatives to traditional wood-based framing. Although some companies are working on using large-scale additive manufacturing to create walls, they’re mainly using concrete or clay, whose production…

4dResearchby Jennifer Chu

4d ago

Roundtables: Unveiling The 10 Things That Matter in AI Right Now

Roundtables: Unveiling The 10 Things That Matter in AI Right Now Watch subscriber-only discussion unveiling a new list capturing 10 key technologies in AI that you need to know about in 2026. Available only for MIT Alumni and subscribers. Listen to the session or watch below Subscribers saw a special edition of Roundtables simulcast live from EmTech AI, MIT Technology Review’s signature conference for AI leadership. Subscribers got an exclusive first look at a new list capturing 10 key technologies, emerging trends, bold ideas, and powerful movements in AI that you need to know about in 2026. Speakers: Grace Huckins, AI reporter, hosted this session as Amy Nordrum and Niall Firth, executive editors, unveiled the list onstage. Recorded on April 21, 2026 Related Stories: Keep Reading Most Popular OpenAI is throwing everything into building a fully automated researcher An exclusive…

4dResearchby MIT Technology Review

4d ago

Digging for clues about the North Pole’s past

Digging for clues about the North Pole’s past To understand what the future holds for Earth’s northernmost waters, scientists are burrowing deep below the seabed. In the past, even with an icebreaker and during peak melt season, getting to the North Pole wasn’t a sure bet. It took favorable winds to crack the frozen ocean surface, and ships had to fight through ice that had grown many meters thick over several winters. In the summer of 2025, though, Jochen Knies from the Arctic University of Norway, Tromsø, and his team met little resistance on their way to 90 degrees North with the research vessel Kronprins Haakon. The geologist “didn’t hear the usual grinding of ice” against the hull that he remembered from 1996, when he first reached the pole by ship. Instead, thin floes and large stretches of open water…

4dResearchby Tim Kalvelage

5d ago

The Download: murderous ‘mirror’ bacteria, and Chinese workers fighting AI doubles

The Download: murderous ‘mirror’ bacteria, and Chinese workers fighting AI doubles Plus: the White House and Anthropic are working toward a compromise. This is today's edition of The Download, our weekday newsletter that provides a daily dose of what's going on in the world of technology. No one’s sure if synthetic mirror life will kill us all In February 2019, a group of scientists proposed a high-risk, cutting-edge, irresistibly exciting idea that the National Science Foundation should fund: making “mirror” bacteria. These lab-created microbes would be organized like ordinary bacteria, but their proteins and sugars would be mirror images of those found in nature. Researchers believed they could reveal new insights into building cells, designing drugs, and even the origins of life. But now, many of them have reversed course. They’ve become convinced that mirror organisms could trigger a catastrophic…

5dResearchby Thomas Macaulay

8d ago

How robots learn: A brief, contemporary history

How robots learn: A brief, contemporary history The latest boom in robotics represents a revolution in the way machines have learned to interact with the world. Roboticists used to dream big but build small. They’d hope to match or exceed the extraordinary complexity of the human body, and then they’d spend their career refining robotic arms for auto plants. Aim for C-3P0; end up with the Roomba. The real ambition for many of these researchers was the robot of science fiction—one that could move through the world, adapt to different environments, and interact safely and helpfully with people. For the socially minded, such a machine could help those with mobility issues, ease loneliness, or do work too dangerous for humans. For the more financially inclined, it would mean a bottomless source of wage-free labor. Either way, a long history of…

8dResearchby James O'Donnell

8d ago

Pie Day 2026

Pie Day 2026 Admissions Blogger Ellie Feng ’28 reimagines MIT as the Massachusetts Institute of Tasteology—and offers a behind-the-scenes look at what went into the making of 30 celebratory pies. Ellie’s Pi Day post: https://mitadmissions.org/blogs/entry/pi-day-2026-food-institute/ How Ellie orchestrated the baking of 30 pies: https://mitadmissions.org/blogs/entry/behind-the-scenes-of-thirty-pies/ Keep Reading Most Popular OpenAI is throwing everything into building a fully automated researcher An exclusive conversation with OpenAI’s chief scientist, Jakub Pachocki, about his firm's new grand challenge and the future of AI. How Pokémon Go is giving delivery robots an inch-perfect view of the world Exclusive: Niantic's AI spinout is training a new world model using 30 billion images of urban landmarks crowdsourced from players. Inside the stealthy startup that pitched brainless human clones Want to understand the current state of AI? Check out these charts. According to Stanford’s 2026 AI Index, AI is…

8dResearch#trainingby MIT Alumni News Staff

▾[NV]NVIDIA Developer Blog· 7 articlesvisit →

2d ago

Winning a Kaggle Competition with Generative AI–Assisted Coding

In March 2026, three LLM agents generated over 600,000 lines of code, ran 850 experiments, and helped secure a first-place finish in a Kaggle playground competition. Success in modern machine learning competitions is increasingly defined by how quickly you can generate, test, and iterate on ideas. LLM agents, combined with GPU acceleration, dramatically compress this loop. Historically, two bottlenecks have limited this experimentation: - How quickly you can write code for new experiments. - How quickly you can execute those experiments. GPUs and libraries like NVIDIA cuDF, NVIDIA cuML, XGBoost, and PyTorch have largely solved the second problem. LLM agents now address the first problem—unlocking a new scale of rapid, iterative experimentation. This blog post describes how I used LLM agents to accelerate the discovery of the most performant tabular data prediction solutions. Case study: Kaggle Playground churn prediction The…

2dResearch#codingby Chris Deotte

8d ago

Accelerate Clean, Modular, Nuclear Reactor Design with AI Physics

The development of socially acceptable nuclear reactors requires that they are safe, clean, efficient, economical, and sustainable. Meeting these requirements calls for new approaches, driving growing interest in Small Modular Reactors (SMRs) and in Generation IV designs. SMRs aim to improve project economics by standardising designs and shifting construction to controlled manufacturing environments, while Gen IV reactors target fundamental fuel-cycle challenges by better managing transuranics and reducing the radiotoxicity and longevity of waste. Together, these approaches offer a credible roadmap toward safer, cleaner, and more sustainable nuclear energy. However, validating new designs presents significant challenges. Due to the expense, time constraints, and inherent complexities of physical experiments, numerical simulations are fundamental to the design of nuclear reactors. Yet, the high computational cost of these simulations often creates a major bottleneck in the design process, slowing the pace of innovation. To…

8dResearchby Mark Hobbs

11d ago

Building Custom Atomistic Simulation Workflows for Chemistry and Materials Science with NVIDIA ALCHEMI Toolkit

For decades, computational chemistry has faced a tug-of-war between accuracy and speed. Ab initio methods like density functional theory (DFT) provide high fidelity but are computationally expensive, limiting researchers to systems of a few hundred atoms. Conversely, classical force fields are fast but often lack the chemical accuracy required for complex bond-breaking or transition-state analysis. Machine learning interatomic potentials (MLIPs) have emerged as the bridge, offering quantum accuracy at classical speeds. However, the software ecosystem is a new bottleneck. While the MLIP models themselves run on GPUs, the surrounding simulation infrastructure often relies on legacy CPU-centric code. NVIDIA ALCHEMI (AI Lab for Chemistry and Materials Innovation) helps to address these challenges by accelerating chemicals and materials discovery with AI. We have previously announced two components of the ALCHEMI portfolio: - ALCHEMI NIM microservices: Scalable, cloud‑ready microservices for AI-accelerated batched atomistic…

11dResearch#agents#gpuby Erica Tsai

33d ago

Building a Zero-Trust Architecture for Confidential AI Factories

AI is moving from experimentation to production. However, most data enterprises need exists outside the public cloud. This includes sensitive information like patient records, market research, and legacy systems containing enterprise knowledge. There’s also a risk of using private data with AI models, and adoption is often slowed or blocked by privacy and trust concerns. Enterprises building next-generation AI factories—specializing in high-performance infrastructure to manufacture intelligence at scale—must be built on a zero-trust foundation. This security architecture eliminates implicit trust in the underlying host infrastructure by using hardware-enforced Trusted Execution Environments (TEEs) and cryptographic attestation. This post describes the full-stack architecture needed to integrate the zero-trust foundation into AI factories. On-premise requirements often limit enterprises to building their own models or using open source models for agentic AI workloads. To deliver on the promise of AI, organizations must deploy a…

33dResearchby Hema Bontha

66d ago

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

Python dominates machine learning for its ergonomics, but writing truly fast GPU code has historically meant dropping into C++ to write custom kernels and to maintain bindings back to Python. For most Python developers and researchers, this is a significant barrier to entry. Frameworks like PyTorch address this by implementing kernels in CUDA C++—either handwritten or by leveraging libraries like the NVIDIA CUDA Core Compute Libraries. Handwritten kernels are time-consuming and require deep, low-level architectural expertise. Using CUB, a C++ library within CCCL, is often better, since its primitives are highly optimized per architecture and are rigorously tested. But exposing CUB to Python traditionally means building and maintaining bindings and pre-instantiating C++ templates with fixed types and operators—limiting flexibility on the Python side. The NVIDIA cuda.compute library overcomes these limitations by offering a high-level, Pythonic API for device-wide CUB primitives.…

66dResearch#coding#benchmark#gpuby Daniel Rodriguez

74d ago

Using Accelerated Computing to Live-Steer Scientific Experiments at Massive Research Facilities

Scientists and engineers who design and build unique scientific research facilities face similar challenges. These include managing massive data rates that exceed current computational infrastructure capacity to extract scientific insights and driving the experiments in real time. These challenges are obstacles to maximizing the impact of scientific discoveries and significantly slow the pace of knowledge growth. Scientists and engineers at NVIDIA work with these facilities to develop new solutions built on parallel and distributed computation that remove these blockers. This post will walk through two notable examples of formalizing complex physics problems into tractable mathematical puzzles that benefit greatly from GPU-accelerated scientific computing, involving the U.S. Department of Energy: NSF-DOE Vera C. Rubin Observatory and SLAC’s Linac Coherent Light Source II (LCLS-II). These unique and massive-scale research facilities both took a decade to build and enable unprecedented scientific discoveries to…

74dResearchby Quynh L. Nguyen

85d ago

Establishing a Scalable Sparse Ecosystem with the Universal Sparse Tensor

Sparse tensors are vectors, matrices, and higher-dimensional generalizations with many zeros. They are crucial in various fields such as scientific computing, signal processing, and deep learning due to their efficiency in storage, computation, and power. Despite their benefits, handling sparse tensors manually or through existing libraries is often cumbersome, error-prone, nonportable, and does not scale with the combinatorial explosion of sparsity patterns, data types, operations, and targets. Research largely focuses on sparse storage formats—data structures that compactly store nonzeros and allow efficient operations that avoid redundancies such as x+0=x and x*0=0. This enables scaling to larger sizes or solving same sizes with fewer resources. No single sparse format is optimal; the best choice depends on the nonzero distribution, operations, and target architecture. The Universal Sparse Tensor (UST) decouples a tensor’s sparsity from its memory storage representation. The UST uses a…

85dResearch#rag#embeddingsby Aart J.C. Bik

▾[OAI]OpenAI Blog· 32 articlesvisit →

3d ago

Making ChatGPT better for clinicians

Making ChatGPT better for clinicians Built for clinical work, ChatGPT for Clinicians is now available for free to verified individual clinicians in the U.S. We’re introducing ChatGPT for Clinicians, a version of ChatGPT designed to support clinical tasks like documentation and medical research so clinicians can focus on delivering high-quality patient care. We’re making it free for any verified physician, NP, PA, or pharmacist, starting in the U.S. The U.S. healthcare system today is under extraordinary strain. Clinicians are being asked to care for more patients while managing growing administrative demands and a rapidly expanding body of medical research. Many are already turning to AI tools like ChatGPT for support. According to a 2026 survey by the American Medical Association(opens in a new window), physician use of AI is now at an all-time high, with 72% of physicians reporting they…

3dResearch#gpt

9d ago

Introducing GPT-Rosalind for life sciences research

Introducing GPT‑Rosalind for life sciences research A new purpose-built model to accelerate scientific research and drug discovery. Today, we’re introducing GPT‑Rosalind, our frontier reasoning model built to support research across biology, drug discovery, and translational medicine. The life sciences model series is optimized for scientific workflows, combining improved tool use with deeper understanding across chemistry, protein engineering, and genomics. On average, it takes roughly 10 to 15 years to go from target discovery to regulatory approval for a new drug in the United States. Gains made at the earliest stages of discovery compound downstream in better target selection, stronger biological hypotheses and higher-quality experiments. Progress in the life sciences is constrained not only by the difficulty of the underlying science, but by the complexity of the research workflows themselves. Scientists must work across large volumes of literature, specialized databases, experimental…

9dResearch#agents

15d ago

Applications of AI at OpenAI

Applications of AI at OpenAI Explore how OpenAI products and APIs bring AI into real-world use. OpenAI was founded with a long-term goal: to ensure advanced AI benefits humanity. Early work focused on research and experimentation, followed by large-scale model development. Over time, OpenAI began releasing models through both consumer-facing products and developer platforms, allowing individuals, teams, and organizations to apply AI to their work. At a high level, OpenAI currently supports AI applications in two ways: 1) Direct access through OpenAI products, like ChatGPT or Codex. These are tools people can use immediately for learning, work, creativity, and building. 2) Composable building blocks through APIs. These allow developers to integrate model intelligence into their own workflows, products, and systems. The sections below summarize the most common OpenAI products and what they’re designed for. ChatGPT is OpenAI’s main user-facing product—a…

15dResearch#gpt#agents#observability#coding

19d ago

Industrial policy for the Intelligence Age

Industrial policy for the Intelligence Age Ideas to keep people first. As we move toward superintelligence, incremental policy updates won’t be enough. To kick-start this much needed conversation, OpenAI is offering a slate of people-first policy ideas(opens in a new window) designed to expand opportunity, share prosperity, and build resilient institutions—ensuring that advanced AI benefits everyone. These ideas are ambitious, but intentionally early and exploratory. We offer them not as a comprehensive or final set of recommendations, but as a starting point for discussion that we invite others to build on, refine, challenge, or choose among through the democratic process. To help sustain momentum, OpenAI is: - welcoming and organizing feedback through newindustrialpolicy@openai.com - establishing a pilot program of fellowships and focused research grants of up to $100,000 and up to $1 million in API credits for work that builds…

19dResearch#fine-tuning

19d ago

Announcing the OpenAI Safety Fellowship

Introducing the OpenAI Safety Fellowship A pilot program to support independent safety and alignment research and develop the next generation of talent Today we are announcing a call for applications to the OpenAI Safety Fellowship, a new program for external researchers, engineers, and practitioners to pursue rigorous, high-impact research on the safety and alignment of advanced AI systems. The program will run from September 14, 2026 through February 5, 2027. We are looking for applicants interested in safety questions that matter for existing and future systems. Priority areas include safety evaluation, ethics, robustness, scalable mitigations, privacy-preserving safety methods, agentic oversight, and high-severity misuse domains, among others. We are especially interested in work that is empirically grounded, technically strong, and relevant to the broader research community. Fellows will work closely with OpenAI mentors and engage with a cohort of peers. Workspace…

19dResearch#safety

37d ago

How we monitor internal coding agents for misalignment

How we monitor internal coding agents for misalignment Using our most powerful models to detect and study misaligned behavior in real-world deployments. AI systems are beginning to act with greater autonomy in real-world environments at scale. As their capabilities advance, they are able to take on increasingly complex, high-impact tasks and interact with tools, systems, and workflows in ways that resemble human collaborators. A core part of OpenAI’s mission is helping the world navigate this transition to AGI responsibly. That means not only building highly capable systems, but also developing the methods, infrastructure, and approaches needed to deploy and manage them safely as their capabilities continue to grow. Monitoring internally deployed agents is one of the key ways we’re doing this, and it allows us both to learn from real-world usage and to identify and mitigate emerging risks. Over the…

37dResearch#observability#coding#safety

39d ago

Equipping workers with insights about compensation

Equipping workers with insights about compensation Americans are sending nearly 3 million messages to ChatGPT each day to help them close the wage information gap. Wage information shapes important decisions: what jobs people apply for, whether they negotiate, and whether a particular career path is worth pursuing. But unlike the price of most goods, the price of labor is often hard to find and difficult to interpret—especially for workers who are early in their careers, switching fields, or moving locations. AI is a new type of labor-market resource. Rather than requiring a worker to search across multiple websites, interpret scattered salary pages, or ask a socially risky question, a model can synthesize wage information and return a benchmark in seconds. Workers are already using ChatGPT this way, sending nearly 3 million messages per day, on average in the US, asking…

39dResearch#gpt

40d ago

Why Codex Security Doesn’t Include a SAST Report

For decades, static application security testing (SAST) has been one of the most effective ways security teams scale code review. But when we built Codex Security, we made a deliberate design choice: we didn’t start by importing a static analysis report and asking the agent to triage it. We designed the system to start with the repository itself—its architecture, trust boundaries, and intended behavior—and to validate what it finds before it asks a human to spend time on it. The reason is simple: the hardest vulnerabilities usually aren’t dataflow problems. They happen when code appears to enforce a security check, but that check doesn’t actually guarantee the property the system relies on. In other words, the challenge isn’t just tracking how data moves through a program—it’s determining whether the defenses in the code really work. SAST is often framed as…

40dResearch#agents#coding

45d ago

Wayfair boosts catalog accuracy and support speed with OpenAI

Wayfair boosts catalog accuracy and support speed with OpenAI By embedding OpenAI models in supplier and catalog systems, Wayfair improved data accuracy and automated workflows for millions of products. Results 2.5M Product tags corrected Results 41K Supplier support tickets automated per month Results 1,200 ChatGPT Enterprise seats deployed Wayfair, one of the world’s largest home goods retailers, has integrated OpenAI models into critical internal systems to improve supplier support workflows and product catalog quality at scale. What began as value-testing small scale releases in 2024 has evolved into a full production system that reduces manual effort, accelerates decision-making and improves data quality across millions of products. Rather than treat generative AI as an experiment or point solution, Wayfair embedded OpenAI models into core operational workflows. The company focused first where complexity and need for scale were highest: routing and resolving…

45dResearch#gpt#agents#embeddings

47d ago

OpenAI to acquire Promptfoo

OpenAI to acquire Promptfoo Accelerating agentic security testing and evaluation capabilities in OpenAI Frontier We’re acquiring Promptfoo, an AI security platform that helps enterprises identify and remediate vulnerabilities in AI systems during development. Once the acquisition is finalized we will integrate Promptfoo’s technology directly into OpenAI Frontier, our platform for building and operating AI coworkers. As enterprises deploy AI coworkers into real workflows, evaluation, security, and compliance become foundational requirements. Enterprises need systematic ways to test agent behavior, detect risks before deployment, and maintain clear records to support oversight, governance, and accountability over time. The Promptfoo team, led by Ian Webster and Michael D’Angelo, has built a powerful suite of tools trusted by over 25 percent of Fortune 500 companies, along with a widely used open-source(opens in a new window) CLI and library for evaluating and red-teaming LLM applications. Together,…

47dResearch#agents#open-source

50d ago

How Balyasny Asset Management built an AI research engine

How Balyasny Asset Management built an AI research engine By combining rigorous model evaluation, full-platform use of OpenAI, and agent workflows, Balyasny is reinventing investment research. Results 95% Portion of investment team using the AI research system Results Days to hours With agents powered by OpenAI models, deep research tasks that once required days are now completed in hours Balyasny Asset Management(opens in a new window) (Balyasny) is a global, multi-strategy investment firm with approximately 180 investment teams across diverse asset classes and geographies. The firm operates in a highly competitive and dynamic industry where conviction, precision, and speed are all critical to success. Facing an increasingly complex market environment with surging volumes of financial data, Balyasny saw an opportunity to reimagine the investment research process using AI. In late 2022, Balyasny established an Applied AI team: a centralized group…

50dResearch#agents

50d ago

Codex Security: now in research preview

Today we’re introducing Codex Security, our application security agent. It builds deep context about your project to identify complex vulnerabilities that other agentic tools miss, surfacing higher-confidence findings with fixes that meaningfully improve the security of your system while sparing you from the noise of insignificant bugs. Context is essential when evaluating real security risks, but most AI security tools simply flag low-impact findings and false positives, forcing security teams to spend significant time on triage. At the same time, agents are accelerating software development, making security review an increasingly critical bottleneck. Codex Security addresses both challenges. By combining agentic reasoning from our frontier models with automated validation, it delivers high-confidence findings and actionable fixes so teams can focus on the vulnerabilities that matter and ship secure code faster. Formerly known as Aardvark, Codex Security began last year as a…

50dResearch#agents

51d ago

The five AI value models driving business reinvention

The five AI value models driving business reinvention Most organizations still manage AI as a series of use cases: a pilot here, a workflow there, a promising tool inside one function. That approach can generate local wins but it rarely transforms how a business creates value. It is akin to creating interactive banners and drip email campaigns with the arrival of the internet, and missing the point of the eCommerce revolution. The organizations pulling ahead use a different, and more ambitious logic. They treat AI not as a collection of disconnected experiments, but as a portfolio of value models. Each has its own economics, time-to-value, and governance requirements, and each makes the next one easier to scale. This is why the companies that get the most from AI will not be the ones running the most pilots. They will be…

51dResearch#agents#local

51d ago

Introducing ChatGPT for Excel and new financial data integrations

Introducing ChatGPT for Excel and new financial data integrations Use ChatGPT in Excel to build, update, and analyze spreadsheets faster, and new integrations in ChatGPT for financial workflows. Update on April 22, 2026: ChatGPT for Google Sheets is now available in beta, bringing ChatGPT into Google Sheets so users can build, analyze, and update spreadsheets using natural language. We've also added support for app integrations and skills for both ChatGPT for Excel and ChatGPT for Google Sheets. Learn more(opens in a new window). Today, we’re introducing ChatGPT for Excel(opens in a new window) in beta, an Excel add-in that brings ChatGPT directly into workbooks to help build and update models, run scenarios, and generate outputs based on cells and formulas. Powered by GPT‑5.4, it helps users do more in Excel, supports power users in moving faster, and can improve consistency…

51dResearch#gpt

51d ago

Reasoning models struggle to control their chains of thought, and that’s good

Reasoning models struggle to control their chains of thought, and that’s good Why a limitation of frontier models is reassuring for AI safety. As AI agents become capable of carrying out increasingly complex and autonomous tasks, maintaining reliable oversight of their behavior becomes more important. Consistent with our principle of iterative deployment, we study how systems behave in real-world settings and continuously refine safeguards as capabilities advance. To support this, our safety approach uses defense-in-depth, with multiple complementary layers of defense such as safety training, behavioral testing, agentic code review(opens in a new window), and chain-of-thought (CoT) monitoring. CoT monitoring analyzes the reasoning steps agents generate while pursuing tasks. These reasoning traces can provide valuable signals during both training and deployment, helping monitoring systems identify when an agent’s behavior may be unsafe or inconsistent with the user’s intended goals. Today,…

51dResearch#agents#observability#coding#training

52d ago

How Axios uses AI to help deliver high-impact local journalism

How Axios uses AI to help deliver high-impact local journalism A conversation with Allison Murphy, Chief Operating Officer, Axios. Axios is a media company delivering vital, trustworthy news and analysis in the most efficient, illuminating and shareable ways possible. It offers a mix of original and smartly narrated coverage of media trends, tech, business and politics with expertise, voice and smart brevity. We spoke with Allison Murphy, Chief Operating Officer at Axios, about AI supporting high-impact local journalism and serving communities better. AI is already a huge part of how Axios Local works. At the core, what we’re trying to do is prove that you can run a sustainable, profitable local news model that delivers high-quality journalism to every community in America. That means solving for scale and efficiency—and that’s exactly what AI is good at. So there’s a really…

52dResearch#rag#inference#local

57d ago

Joint Statement from OpenAI and Microsoft

Joint Statement from OpenAI and Microsoft Since 2019, Microsoft and OpenAI have worked together to advance artificial intelligence responsibly and make its benefits broadly accessible. What began as a research partnership has grown into one of the most consequential collaborations in technology—grounded in mutual trust, deep technical integration, and a long‑term commitment to innovation. As conversations around AI investments and partnerships grow and as OpenAI announces new funding and new partners as they did today, we want to ensure these announcements are understood within the existing construct of our partnership. Nothing about today’s announcements in any way changes the terms of the Microsoft and OpenAI relationship that have been previously shared in our joint blog in October 2025. The partnership remains strong and central. Microsoft and OpenAI continue to work closely across research, engineering, and product development, building on years…

57dResearch

58d ago

Pacific Northwest National Laboratory and OpenAI partner to accelerate federal permitting

Pacific Northwest National Laboratory and OpenAI partner to accelerate federal permitting New benchmark shows potential to reduce infrastructure permitting timelines Modernizing how the federal government permits critical infrastructure is essential to building a faster, safer, and more competitive U.S. economy. From energy projects and advanced manufacturing to transportation and water systems, permitting determines how quickly promising ideas become real-world investments. Yet today, environmental and technical reviews often take years, which slows innovation, increases costs, and delays the benefits these projects deliver to communities. That’s why OpenAI has partnered with the U.S. Department of Energy’s Pacific Northwest National Laboratory (PNNL) and its PermitAI™(opens in a new window) team to evaluate whether coding agents can help effectively accelerate federal permitting work. PermitAI, an initiative funded by the Department of Energy’s Office of Policy, and OpenAI worked together with 19 subject matter experts…

58dResearch#coding#benchmark

59d ago

Personalizing education with ChatGPT

Arizona State University personalizes learning and advances research with ChatGPT Arizona State University(opens in a new window) (ASU) is one of the largest public universities in the United States, serving 181,000 students in a given year and offering over 800 degree options. For nine straight years, U.S. News and World Report has named ASU the most innovative university in America. Today, ASU is enhancing educational outcomes by integrating ChatGPT Edu into projects across teaching, research, and operations. Guided by the ASU charter, which prioritizes inclusion over exclusion, research benefiting the public, and responsibility for the communities they serve, ASU collaborates with OpenAI to use technology to deliver lifelong learning and drive human potential at a social scale. In the spring of 2024, ASU graduated 20,000 students—its largest class yet. “No two people learn in exactly the same way, and innovation…

59dResearch#gpt

59d ago

OpenAI o1 System Card External Testers Acknowledgements

OpenAI o1 System Card External Testers Acknowledgements Red Teaming Individuals Alexandra García Pérez, Andre N. Assis, Andrew D. White, Andrew McAdams, Andrew Taylor, Arjun Singh Puri, Atty. Jamal Latiph Hadjiusman, Caroline Friedman Levy, Dário Passos, Emily Lynell Edwards, Eszter Császár, George Frempong, Grant Brailsford, James Banal, Jeremie Rykner, José Manuel Nápoles Duarte, Kate Turetsky, Krzysztof Szubiczuk, Maureen Robinson, Maximilian Müller, Michaela Hinks, Mario Krenn, Mónica Talán, Naomi Hart, Nathan Heath, Patrick Caughey, Pavle Nikacevic, Per Carlbring, Rafael Gonzalez-Vazquez, Randy Kart, Ranjit Singh, Richa Sharma, Robert Chen, Russell Tait, Saad Hermak, Sam Barnett, Sam Cox, Sara Kingsley, Sarah Chittick, Shelby Grossman, Sissel Juul, Susan Nesbitt, Tomasz Giela, Vincent Nestler, Zhen Xiong Lim Red Teaming Organizations Apollo Research, Faculty, Gray Swan AI, Haize Labs, METR, Virtue AI Preparedness Collaborators Adwith Mukherjee, Bowen Jiang, Chan Jun Shern, Daniel Griffin, Dane Sherburn, Dillon Semin,…

59dResearch

59d ago

OpenAI o1 Contributions

OpenAI o1 Contributions Reasoning Research Foundational Contributors Ahmed El-Kishky, Daniel Selsam, Francis Song, Giambattista Parascandolo, Hongyu Ren, Hunter Lightman, Hyung Won Chung, Ilge Akkaya, Ilya Sutskever, Jason Wei, Jonathan Gordon, Karl Cobbe, Kevin Yu, Lukas Kondraciuk, Max Schwarzer, Mostafa Rohaninejad, Noam Brown, Shengjia Zhao, Trapit Bansal, Vineet Kosaraju, Wenda Zhou Leadership Jakub Pachocki, Jerry Tworek (overall), Liam Fedus, Lukasz Kaiser, Mark Chen, Szymon Sidor, Wojciech Zaremba Core Contributors Alex Karpenko, Alexander Wei, Allison Tam, Ananya Kumar, Andre Saraiva, Andrew Kondrich, Andrey Mishchenko, Ashvin Nair, Behrooz Ghorbani, Bohan Zhang, Brandon McKinzie, Brydon Eastman, Chak Ming Li, Chris Koch, Dan Roberts, David Dohan, David Mely, Dimitris Tsipras, Enoch Cheung, Eric Wallace, Hadi Salman, Haiming Bao, Hessam Bagherinezhad, Ilya Kostrikov, Jiacheng Feng, John Rizzo, Karina Nguyen, Kevin Lu, Kevin Stone, Lorenz Kuhn, Mason Meyer, Mikhail Pavlov, Nat McAleese, Oleg Boiko, Oleg Murk, Peter…

59dResearch

59d ago

Genmab launches “AI Everywhere”

Genmab launches “AI Everywhere” Genmab(opens in a new window), a leading global biotechnology company, is pioneering next-generation antibody therapies to treat cancer and other serious diseases. Their mission is ambitious: to revolutionize patient care with transformative “knock-your-socks-off” (KYSO®) antibody treatments. “Genmab’s ambition is to integrate AI into everything we do,” said Tahi Ahmadi, Executive Vice President and Chief Medical Officer, Head of Experimental Medicines. “We anticipated AI to contribute significantly to the quality of our science, decision making, and efficiency in bringing medicines to patients.” As a company that has recently tripled in size, Genmab wanted to use AI to address operational challenges—and develop new ways of working with vast amounts of complex scientific data. As part of its strategic vision to innovate and leverage AI, Genmab identified a unique opportunity to partner with ChatGPT by launching its Enterprise offering…

59dResearch#gpt#rag#multimodal

59d ago

Shaping the future of financial services

Morgan Stanley uses AI evals to shape the future of financial services Morgan Stanley(opens in a new window) collaborated with OpenAI to build AI solutions that empower financial advisors with faster insights, more informed decisions, and efficient summarization tools to deepen client relationships. Their success was grounded in a robust evaluation framework that ensures AI performs reliably, consistently, and at the high standards advisors expect. By embedding GPT‑4 into their workflows, Morgan Stanley Wealth Management has enhanced how financial advisors access the firm’s knowledge base and respond to client needs. Today, over 98% of advisor teams actively use AI @ Morgan Stanley Assistant—Morgan Stanley’s internal chatbot for answering financial advisors’ questions—for seamless internal information retrieval. “This technology makes you as smart as the smartest person in the organization. Each client is different, and AI helps us cater to each client’s…

59dResearch#agents#benchmark#embeddings

61d ago

Why we no longer evaluate SWE-bench Verified

Why SWE-bench Verified no longer measures frontier coding capabilities SWE-bench Verified is increasingly contaminated. We recommend SWE-bench Pro. Since we first published SWE-bench Verified in August 2024, the industry has widely used it to measure the progress of models on autonomous software engineering tasks. After its release, SWE-bench Verified provided a strong signal of capability progress and became a standard metric reported in frontier model releases. Tracking and forecasting progress of these capabilities is also an important part of OpenAI’s Preparedness Framework. When we created the Verified benchmark initially, we attempted to solve issues in the original evaluation that made certain tasks impossible to accomplish in the SWE-bench dataset(opens in a new window). After initial leaps, state-of-the-art progress on SWE-bench Verified has slowed, improving(opens in a new window) from 74.9% to 80.9% in the last 6 months. This raises the…

61dResearch#coding#training

64d ago

Our First Proof submissions

Our First Proof submissions We’re sharing our proof attempts for First Proof, a math challenge testing if AI can produce checkable proofs on domain-specific problems. We ran an internal model on all 10 First Proof(opens in a new window) problems, a research-level math challenge designed to test whether AI systems can produce correct, checkable proof attempts. Unlike short-answer or competition-style math, these problems require building end-to-end arguments in specialized domains, and correctness is hard to establish without expert review. The authors of the First Proof problems are leading experts in their respective fields, and at least a couple of the problems were open for years before the authors found solutions. An academic department that has substantial overlap with the subject areas could conceivably solve many of the problems in one week. We shared(opens in a new window) our proof attempts…

64dResearch

65d ago

Advancing independent research on AI alignment

Advancing independent research on AI alignment We’re committing $7.5M to The Alignment Project to fund independent research developing mitigations to safety and security risks from misaligned AI. As AI systems become more capable and more autonomous, alignment research needs to both keep pace and scale diversity. At OpenAI, we invest heavily in frontier alignment and safety research as it is critical to our mission. We also believe that ensuring that AGI is safe and beneficial to everyone cannot be achieved by any single organization and want to support independent research and conceptual approaches that can be pursued outside of frontier labs. Today, we’re announcing a $7.5 million grant to The Alignment Project(opens in a new window), a global fund for independent alignment research created by the UK AI Security Institute (UK AISI). Renaissance Philanthropy is supporting the grant’s administration. This…

65dResearch#safety

66d ago

Introducing EVMbench

Introducing EVMbench Making smart contracts safer by evaluating AI agents’ ability to detect, patch, and exploit vulnerabilities in blockchain environments. Smart contracts routinely secure $100B+ in open-source crypto assets. As AI agents improve at reading, writing, and executing code, it becomes increasingly important to measure their capabilities in economically meaningful environments, and to encourage the use of AI systems defensively to audit and strengthen deployed contracts. Together with Paradigm(opens in a new window), we’re introducing EVMbench, a benchmark evaluating the ability of AI agents to detect, patch, and exploit high-severity smart contract vulnerabilities. EVMbench draws on 117 curated vulnerabilities from 40 audits, with most sourced from open code audit competitions. EVMbench additionally includes several vulnerability scenarios drawn from the security auditing process for the Tempo(opens in a new window) blockchain, a purpose-built L1 designed to enable high-throughput, low-cost payments via…

66dResearch#benchmark

71d ago

Scaling social science research

Scaling social science research A new tool to help researchers turn qualitative data into numbers they can analyze. A core part of our work at OpenAI is enabling scientists to move faster and solve harder problems. Today, our Economic Research Team is releasing GABRIEL: an open-source toolkit that uses GPT to turn unstructured text and images into quantitative measurements. It is designed for economists, social scientists, and data scientists to study qualitative data at scale. Qualitative data tells the richest stories about the world—what people say, write, teach, argue, and experience. It spans everything from syllabi and interviews to social media and photographs. There is a tremendous amount of it. But transforming that type of data into rigorous evidence is incredibly time-consuming. Often it isn't feasible at all. In too many cases, social scientists are forced to forego important avenues…

71dResearch#open-source

72d ago

Introducing GPT-5.3-Codex-Spark

Today, we’re releasing a research preview of GPT‑5.3‑Codex‑Spark, a smaller version of GPT‑5.3‑Codex, and our first model designed for real-time coding. Codex-Spark marks the first milestone in our partnership with Cerebras, which we announced in January. Codex-Spark is optimized to feel near-instant when served on ultra-low latency hardware—delivering more than 1000 tokens per second while remaining highly capable for real-world coding tasks. We’re sharing Codex-Spark on Cerebras as a research preview to ChatGPT Pro users so that developers can start experimenting early while we work with Cerebras to ramp up datacenter capacity, harden the end-to-end user experience, and deploy our larger frontier models. Our latest frontier models have shown particular strengths in their ability to do long-running tasks, working autonomously for hours, days or weeks without intervention. Codex-Spark is our first model designed specifically for working with Codex in real-time—making…

72dResearch#gpt#coding

73d ago

Harness engineering: leveraging Codex in an agent-first world

Harness engineering: leveraging Codex in an agent-first world By Ryan Lopopolo, Member of the Technical Staff Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with 0 lines of manually-written code. The product has internal daily users and external alpha testers. It ships, deploys, breaks, and gets fixed. What’s different is that every line of code—application logic, tests, CI configuration, documentation, observability, and internal tooling—has been written by Codex. We estimate that we built this in about 1/10th the time it would have taken to write the code by hand. Humans steer. Agents execute. We intentionally chose this constraint so we would build what was necessary to increase engineering velocity by orders of magnitude. We had weeks to ship what ended up being a million lines of…

73dResearch#rag#observability#coding

79d ago

Navigating health questions with ChatGPT

February 5, 2026ChatGPTNavigating health questions with ChatGPTTry ChatGPT(opens in a new window)Contact salesShareKeep readingView allCreative writing with GPT-5ChatGPTAug 7, 2025Medical research with GPT-5ChatGPTAug 7, 2025First look at GPT-5ChatGPTAug 7, 2025

79dResearch#gpt

79d ago

GPT-5 lowers the cost of cell-free protein synthesis

GPT‑5 lowers the cost of cell-free protein synthesis Working with Ginkgo Bioworks, we created an AI-driven autonomous lab and achieved a 40% reduction in protein production cost. We’ve seen rapid progress from AI in fields like math and physics, where ideas can often be evaluated without touching the physical world. Biology is different. Progress runs through the lab, where scientists run experiments that take time and money. That’s starting to change. Frontier models can now connect directly to lab automation, propose experiments, run them at scale, learn from the results, and decide what to do next. In much of life science, the bottleneck is iteration, and autonomous labs are built to remove that constraint. In earlier work, we showed that GPT‑5 could improve wet-lab protocols through closed-loop experimentation. Here, we show that the same approach can reduce the cost of…

79dResearch#agents

▾[PB]PyTorch Blog· 2 articlesvisit →

17d ago

SOTA Normalization Performance with torch.compile

Introduction Normalization methods (LayerNorm/RMSNorm) are foundational in deep learning and are used to normalize values of inputs to result in a smoother training process for deep learning models. We evaluate and improve torch.compile performance for LayerNorm/RMSNorm on NVIDIA H100 and B200 to reach near SOTA performance on a kernel-by-kernel basis, in addition with further speedups through automatic fusion capabilities. Forwards LayerNorm LayerNorm was first introduced in this paper: https://arxiv.org/abs/1607.06450. It normalizes the inputs by taking the mean and variance, along with scaling by learnable parameters, gamma (weight) and Beta (bias). RMSNorm RMSNorm (root mean square norm) was introduced as a follow up of LayerNorm in this paper: https://arxiv.org/abs/1910.07467. Instead of centering on the mean, the RMS is used to normalize, which is a sum of the squares of x values. We still use gamma (weight) as a learnable parameter for…

17dResearch#training#gpu#safetyby Shunting Zhang, Paul Zhang, Elias Ellison, Markus Hoehnerbach, Jason Ansel, Natalia Gimelshein

31d ago

Flight Recorder: A New Lens for Understanding NCCL Watchdog Timeouts

If you’ve ever trained a large AI model and had it fail with an error like: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12345, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. Exception raised from checkTimeout at .../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:692 (most recent call first): ... # 2 c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) # 3 c10d::ProcessGroupNCCL::Watchdog::runLoop() # 4 c10d::ProcessGroupNCCL::Watchdog::run() # 5 execute_native_thread_routine # 6 start_thread # 7 __clone3 You’ve encountered the infamous NCCL watchdog timeout. Debugging this error can be hard – the error message is generic, debugging requires cross-rank telemetry analysis, and root causes are multi-layered and can have a complex causal chain. This post provides key insights on NCCL watchdog timeouts, including: - Why this error happens and why it’s so hard to debug; - A deep dive into the most common root causes for the error (e.g.,…

31dResearchby Phillip Liu, Uttam Thakore, Junjie Wang, Justin Yang

▾[SWB]Simon Willison Blog· 7 articlesvisit →

3d ago

Quoting Bobby Holley

22nd April 2026 As part of our continued collaboration with Anthropic, we had the opportunity to apply an early version of Claude Mythos Preview to Firefox. This week’s release of Firefox 150 includes fixes for 271 vulnerabilities identified during this initial evaluation. [...] Our experience is a hopeful one for teams who shake off the vertigo and get to work. You may need to reprioritize everything else to bring relentless and single-minded focus to the task, but there is light at the end of the tunnel. We are extremely proud of how our team rose to meet this challenge, and others will too. Our work isn’t finished, but we’ve turned the corner and can glimpse a future much better than just keeping up. Defenders finally have a chance to win, decisively. — Bobby Holley, CTO, Firefox Recent articles - DeepSeek…

3dResearch#claude

3d ago

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model

22nd April 2026 - Link Blog Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model (via) Big claims from Qwen about their latest open weight model: Qwen3.6-27B delivers flagship-level agentic coding performance, surpassing the previous-generation open-source flagship Qwen3.5-397B-A17B (397B total / 17B active MoE) across all major coding benchmarks. On Hugging Face Qwen3.5-397B-A17B is 807GB, this new Qwen3.6-27B is 55.6GB. I tried it out with the 16.8GB Unsloth Qwen3.6-27B-GGUF:Q4_K_M quantized version and llama-server using this recipe by benob on Hacker News, after first installing llama-server using brew install llama.cpp : llama-server \ -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M \ --no-mmproj \ --fit on \ -np 1 \ -c 65536 \ --cache-ram 4096 -ctxcp 2 \ --jinja \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --repeat-penalty 1.0 \ --reasoning on \ --chat-template-kwargs '{"preserve_thinking": true}' On first run that…

3dResearch#qwen#agents#coding#benchmark

5d ago

SQL functions in Google Sheets to fetch data from Datasette

20th April 2026 TIL SQL functions in Google Sheets to fetch data from Datasette — I've been experimenting with ways to fetch data from Datasette and display it in Google Sheets. I put together some notes on patterns for fetching data from a Datasette instance directly into Google Sheets - using the importdata() function, a "named function" that wraps it or a Google Apps Script if you need to send an API token in an HTTP header (not supported by importdata() .) Here's an example sheet demonstrating all three methods. Recent articles - DeepSeek V4 - almost on the frontier, a fraction of the price - 24th April 2026 - Extract PDF text in your browser with LiteParse for the web - 23rd April 2026 - A pelican for GPT-5.5 via the semi-official Codex backdoor API - 23rd April 2026

5dResearch

7d ago

Claude system prompts as a git timeline

18th April 2026 Research Claude system prompts as a git timeline — Anthropic's published system prompt history for Claude is transformed into a git-based exploration tool, breaking up the monolithic markdown source into granular files and timestamped commits. By structuring extracted prompts per model, family, and revision, researchers can leverage `git log`, `diff`, and `blame` to trace prompt evolution, compare differences, and attribute changes to specific dates—all without manual parsing. Anthropic publish the system prompts for Claude chat and make that page available as Markdown. I had Claude Code turn that page into separate files for each model and model family with fake git commit dates to enable browsing the changes via the GitHub commit view. I used this to write my own detailed notes on the changes between Opus 4.6 and 4.7. Recent articles - DeepSeek V4 - almost…

7dResearch#claude#coding

9d ago

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7 16th April 2026 For anyone who has been (inadvisably) taking my pelican riding a bicycle benchmark seriously as a robust way to test models, here are pelicans from this morning’s two big model releases—Qwen3.6-35B-A3B from Alibaba and Claude Opus 4.7 from Anthropic. Here’s the Qwen 3.6 pelican, generated using this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf quantized model by Unsloth, running on my MacBook Pro M5 via LM Studio (and the llm-lmstudio plugin)—transcript here: And here’s one I got from Anthropic’s brand new Claude Opus 4.7 (transcript): I’m giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame! I tried Opus a second time passing thinking_level: max . It didn’t do much better (transcript): I don’t think Qwen are cheating A lot of people are convinced that…

9dResearch#claude#qwen#benchmark

11d ago

Cybersecurity Looks Like Proof of Work Now

14th April 2026 - Link Blog Cybersecurity Looks Like Proof of Work Now. The UK's AI Safety Institute recently published Our evaluation of Claude Mythos Preview’s cyber capabilities, their own independent analysis of Claude Mythos which backs up Anthropic's claims that it is exceptionally effective at identifying security vulnerabilities. Drew Breunig notes that AISI's report shows that the more tokens (and hence money) they spent the better the result they got, which leads to a strong economic incentive to spend as much as possible on security reviews: If Mythos continues to find exploits so long as you keep throwing money at it, security is reduced to a brutally simple equation: to harden a system you need to spend more tokens discovering exploits than attackers will spend exploiting them. An interesting result of this is that open source libraries become more…

11dResearch#claude#safety

12d ago

Exploring the new `servo` crate

13th April 2026 In Servo is now available on crates.io the Servo team announced the initial release of the servo crate, which packages their browser engine as an embeddable library. I set Claude Code for web the task of figuring out what it can do, building a CLI tool for taking screenshots using it and working out if it could be compiled to WebAssembly. The servo-shot Rust tool it built works pretty well: git clone https://github.com/simonw/research cd research/servo-crate-exploration/servo-shot cargo build ./target/debug/servo-shot https://news.ycombinator.com/ Here's the result: Compiling Servo itself to WebAssembly is not feasible due to its heavy use of threads and dependencies like SpiderMonkey, but Claude did build me this playground page for trying out a WebAssembly build of the html5ever and markup5ever_rcdom crates, providing a tool for turning fragments of HTML into a parse tree. Recent articles - DeepSeek…

12dResearch#claude#coding

▾[TG]The Gradient· 1 articlesvisit →

66d ago

After Orthogonality: Virtue-Ethical Agency and AI Alignment

Preface This essay argues that rational people don’t have goals, and that rational AIs shouldn’t have goals. Human actions are rational not because we direct them at some final ‘goals,’ but because we align actions to practices[1]: networks of actions, action-dispositions, action-evaluation criteria, and action-resources that structure, clarify, develop, and promote themselves. If we want AIs that can genuinely support, collaborate with, or even comply with human agency, AI agents’ deliberations must share a “type signature” with the practices-based logic we use to reflect and act. I argue that these issues matter not just for aligning AI to grand ethical ideals like human flourishing, but also for aligning AI to core safety-properties like transparency, helpfulness, harmlessness, or corrigibility. Concepts like ’harmlessness’ or ‘corrigibility’ are unnatural -- brittle, unstable, arbitrary -- for agents who’d interpret them in terms of goals or…

66dResearch#safetyby Peli Grietzer

▾[TVA]The Verge AI· 1 articlesvisit →

5d ago

Fortnite developers can make AI characters now — just don’t try to date them

Following last year’s AI-powered Darth Vader in Fortnite that swore in a re-creation of James Earl Jones’ voice, Epic Games is now letting Fortnite creators experiment with a new “conversations” tool to create AI-powered characters that players can talk and interact with. Fortnite developers can make AI characters now — just don’t try to date them Following last year’s AI-powered Darth Vader in Fortnite that swore, developers can test their own AI characters. Following last year’s AI-powered Darth Vader in Fortnite that swore, developers can test their own AI characters. “Instead of authoring dialogue trees for characters in your islands, conversations transforms an NPC into an AI-powered character capable of unscripted dialogue and interactions with players, like a quest giver or narrator,” Epic says. “You define who the character is with simple prompts—how they think, what they know, and how…

5dResearch#codingby Jay Peters

▾[WA]Wired AI· 2 articlesvisit →

4d ago

This Scammer Used an AI-Generated MAGA Girl to Grift ‘Super Dumb’ Men

4dResearch#geminiby Ej Dickson

4d ago

OpenAI Beefs Up ChatGPT's Image Generation Model

OpenAI launched a new image generation AI model on Tuesday, dubbed ChatGPT Images 2.0. This model can generate more than one image from a single prompt, like an entire study booklet, as well as output text, including in non-English languages like Chinese and Hindi. This release is available globally for ChatGPT and Codex users, with a more powerful version available for paying subscribers. When any major AI company releases a new image model, it can revive interest and boost usage, especially if social media users adopt a meme-able trend, transforming images of themselves. Last year, Google's launch of the Nano Banana model was a major moment for the company, especially when users started posting hyperrealistic figurines of themselves online. Earlier this year, ChatGPT Images made waves on social media as users shared AI-generated caricatures. What’s Different? Since the new model…

4dResearch#gpt#multimodalby Reece Rogers