★ TOP STORY[ MRB ]Infra·3d ago

AutoAdapt: Automated domain adaptation for large language models

At a glance - Problem: Adapting large language models to specialized, high-stakes domains is slow, expensive, and hard to reproduce. - What we built: AutoAdapt automates planning, strategy selection (e.g., RAG vs. fine-tuning), and tuning under real deployment constraints. - How it works: A structured configuration graph maps the full scope of the adaptation process, an agentic planner selects and sequences the right steps, and a budget-aware optimization loop (AutoRefine) refines the process within defined constraints. - Why it matters: The result is faster, automated, more reliable domain adaptation that turns weeks of manual iteration into repeatable pipelines. Deploying large language models (LLMs) in real-world, high-stakes settings is harder than it should be. In high-stakes settings like law, medicine, and cloud incident response, performance and reliability can quickly break down because adapting models to domain-specific requirements is a slow and…

Microsoft Research Blogread →

▲ trending · last 48hview all →

▾[MRB]Microsoft Research Blog· 10 articlesvisit →

5d ago

Can we AI our way to a more sustainable world?

Technical advancement is moving at such a rapid pace that it can be challenging to define the tomorrow we’re working toward. In The Shape of Things to Come, Microsoft Research leader Doug Burger and experts from across disciplines tease out the thorniest AI issues facing technologists, policymakers, business decisionmakers, and other stakeholders today. The goal: to amplify the shared understanding needed to build a future in which the AI transition is a net positive. In this episode, Burger is joined by Amy Luers, head of sustainability science and innovation at Microsoft, and Ishai Menache, an optimization researcher at Microsoft Research, to explore how AI can both contribute to and help address climate change, emphasizing the need to separate hype from data and understand its real impact. While datacenters account for a small share of global emissions, their rapid growth raises…

5dResearchby Doug Burger, Amy Luers, Ishai Menache

16d ago

New Future of Work: AI is driving rapid change, uneven benefits

At a glance - AI is driving rapid changes in the workplace, more sharply than those covered in previous editions of the New Future of Work - AI is changing how people work together, not just enabling them to work faster or from remote locations. Organizations that treat AI as a collaborative partner are seeing the biggest benefits. - The benefits of AI are not yet evenly distributed, underscoring the need for industry leaders to build AI that expands opportunity. The future is not predetermined. It will be shaped by the choices we make today. - Human expertise matters more, not less, in an AI-powered world. People are shifting from merely doing work to guiding, critiquing, and improving the work of AI. For the past five years, the New Future of Work report has captured how work is changing. This…

16dResearchby Jaime Teevan, Sonia Jaffe, Rebecca Janssen, Nancy Baym, Siân Lindley, Bahar Sarrafzadeh, Brent Hecht, Jenna Butler, Jake Hofman, Sean Rintel

16d ago

Ideas: Steering AI toward the work future we want

Behind every emerging technology is a great idea propelling it forward. In the Microsoft Research Podcast series Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets. Since 2020, researchers across Microsoft have conducted, surfaced, and analyzed key research into how people work as part of the New Future of Work research initiative. They’ve done this through a variety of lenses—from changes caused by the pandemic to the adoption of hybrid work practices to the arrival of increasingly capable AI models—with the goal of empowering people and organizations to redefine work in real time. In this episode, Microsoft Chief Scientist and Technical Fellow Jaime Teevan talks with researchers Jenna Butler, Jake Hofman, and Rebecca Janssen about the latest efforts: the Microsoft…

16dResearchby Jaime Teevan, Jenna Butler, Jake Hofman, Rebecca Janssen

24d ago

ADeLe: Predicting and explaining AI performance across tasks

At a glance - AI benchmarks report performance on specific tasks but provide limited insight into underlying capabilities; ADeLe evaluates models by scoring both tasks and models across 18 core abilities, enabling direct comparison between task demands and model capabilities. - Using these ability scores, the method predicts performance on new tasks with ~88% accuracy, including for models such as GPT-4o and Llama-3.1. - It builds ability profiles and identifies where models are likely to succeed or fail, highlighting strengths and limitations across tasks. - By linking outcomes to task demands, ADeLe explains differences in performance, showing how it changes as task complexity increases. AI benchmarks report how large language models (LLMs) perform on specific tasks but provide little insight into their underlying capabilities that drive their performance. They do not explain failures or reliably predict outcomes on new tasks.…

24dResearch#benchmarkby Lexin Zhou, Xing Xie

30d ago

AsgardBench: A benchmark for visually grounded interactive planning

At a glance - To successfully complete tasks, embodied AI agents must ground and update their plans based on visual feedback. - AsgardBench isolates whether agents can use visual observations to revise their plans as tasks unfold. - Spanning 108 controlled task instances across 12 task types, the benchmark requires agents to adapt their plans based on what they observe. - Because objects can be in different positions and states (e.g., clean or dirty), the same instruction can require different action sequences, even in the same environment. Imagine a robot tasked with cleaning a kitchen. It needs to observe its environment, decide what to do, and adjust when things don’t go as expected, for example, when the mug it was tasked to wash is already clean, or the sink is full of other items. This is the domain of embodied…

30dResearch#benchmarkby Andrea Tupini, Lars Liden, Reuben Tan, Yu Wang, Jianfeng Gao

30d ago

GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation

At a glance - VLM-based robot planners struggle with long, complex tasks because natural-language plans can be ambiguous, especially when specifying both actions and locations. - GroundedPlanBench evaluates whether models can plan actions and determine where they should occur across diverse, real-world robot scenarios. - Video-to-Spatially Grounded Planning (V2GP) is a framework that converts robot demonstration videos into spatially grounded training data, enabling models to learn planning and grounding jointly. - Grounded planning improves both task success and action accuracy, outperforming decoupled approaches in benchmark and real-world evaluations. Vision-language models (VLMs) use images and text to plan robot actions, but they still struggle to decide what actions to take and where to take them. Most systems split these decisions into two steps: a VLM generates a plan in natural language, and a separate model translates it into executable actions. This…

30dResearch#multimodalby Sehun Jung, HyunJee Song, Dong-Hee Kim, Reuben Tan, Jianfeng Gao, Yong Jae Lee, Donghyun Kim

33d ago

Will machines ever be intelligent?

Technical advances are moving at such a rapid pace that it can be challenging to define the tomorrow we’re working toward. In The Shape of Things to Come, Microsoft Research leader Doug Burger and experts from across disciplines tease out the thorniest AI issues facing technologists, policymakers, business decision-makers, and other stakeholders today. The goal: to amplify the shared understanding needed to build a future in which the AI transition is a net positive. In this first episode of the series, Burger is joined by Nicolò Fusi of Microsoft Research and Subutai Ahmad (opens in new tab) of Numenta to examine whether today’s AI systems are truly intelligent. They compare transformer-based large language models (LLMs) with the human brain’s distributed, continuously learning architecture, exploring differences in efficiency, representation, and sensory-motor grounding. The discussion probes what intelligence really means, where current…

33dResearchby Doug Burger,  Subutai Ahmad, Nicolo Fusi

44d ago

Systematic debugging for AI agents: Introducing the AgentRx framework

At a glance - Problem: Debugging AI agent failures is hard because trajectories are long, stochastic, and often multi-agent, so the true root cause gets buried. - Solution: AgentRx (opens in new tab) pinpoints the first unrecoverable (“critical failure”) step by synthesizing guarded, executable constraints from tool schemas and domain policies, then logging evidence-backed violations step-by-step. - Benchmark + taxonomy: We release AgentRx Benchmark (opens in new tab) with 115 manually annotated failed trajectories across τ-bench, Flash, and Magentic-One, plus a grounded nine-category failure taxonomy. - Results + release: AgentRx improves failure localization (+23.6%) and root-cause attribution (+22.9%) over prompting baselines, and we are open-sourcing the framework and dataset. As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency. When…

44dResearch#agentsby Shraddha Barke, Arnav Goyal, Alind Khare, Chetan Bansal

46d ago

PlugMem: Transforming raw agent interactions into reusable knowledge

At a glance - Today’s AI agents store long interaction histories but struggle to reuse them effectively. - Raw memory retrieval can overwhelm agents with lengthy, low-value context. - PlugMem transforms interaction history into structured, reusable knowledge. - A single, general-purpose memory module improves performance across diverse agent benchmarks while using fewer memory tokens. It seems counterintuitive: giving AI agents more memory can make them less effective. As interaction logs accumulate, they grow large, fill with irrelevant content, and become increasingly difficult to use. More memory means that agents must search through larger volumes of past interactions to find information relevant to the current task. Without structure, these records mix useful experiences with irrelevant details, making retrieval slower and less reliable. The challenge is not storing more experiences, but organizing them so that agents can quickly identify what matters in…

46dAgents#agentsby Ke Yang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, ChengXiang Zhai

52d ago

Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

At a glance - Phi-4-reasoning-vision-15B is a compact and smart open‑weight multimodal reasoning model that balances reasoning power, efficiency, and training data needs. It is a broadly capable model that allows for natural interaction for a wide array of vision-language tasks and excels at math and science reasoning and understanding user-interfaces. - We share lessons learned and best practices for training a multimodal reasoning model—showing the benefit of careful architecture choices, rigorous data curation, and the benefits of using a mixture of reasoning and non-reasoning data. We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as image captioning,…

52dInfra#phi#multimodal#trainingby Jyoti Aneja, Michael Harrison, Neel Joshi, Tyler LaBonte, John Langford, Eduardo Salinas