Timeahead

★ TOP STORY[ FAB ]Infra·19d ago

Fireworks Training is now in preview: an end-to-end platform for training and deploying frontier models at scale. Three surfaces for three kinds of teams, from a conversational agent that handles everything, to managed infrastructure for ML engineers, to bring-your-own training loop on Fireworks-hosted clusters. All on the same infrastructure that already handles production inference for Cursor, Vercel, Genspark, and others. All three surfaces are in preview now. Reinforcement learning is how teams push past the ceiling SFT hits on multi-step reasoning, reliable tool use, and mid-flight self-correction. Vercel used our RL infrastructure to build a custom "Auto Fix" model for v0. The model checks the output stream for errors and self-corrects without a second pass, reaching a 93% error-free generation rate, significantly outperforming closed frontier models, with a 40X improvement in end-to-end latency vs. the proprietary model it replaced and…

Fireworks AI Blogread →

▲ trending · last 48hview all →

▾[FAB]Fireworks AI Blog· 99 articlesvisit →

22d ago

4/3/2026 Scaling and Optimizing Frontier Model Training

On this page How Fireworks scales frontier model training and offers the broadest set of fine-tunable MoE models on any platform. Training trillion-parameter Mixture-of-Experts (MoE) models has historically been bottlenecked by memory walls and complex cluster orchestration. Earlier this month, Cursor released Composer 2 — a frontier coding model that tops CursorBench at 61.3, SWE-bench Multilingual at 73.7, and Terminal-Bench at 61.7. Fireworks powers the Reinforcement Learning (RL) inference infrastructure behind it, proving that these bottlenecks can be overcome at scale. We have written about delta-compressed weight sync and multi-region rollout fleets, and about why numerical parity between training and inference is especially hard for MoE models. Those posts cover the inference half of the RL loop — rollouts, weight transfer, and numerical alignment. This post covers the last missing piece: the trainer itself. Our Training SDK provides the model…

22dHardware#fine-tuning#inference#training

28d ago

3/28/2026 The Fine-Tuning Bottleneck Isn't the Algorithm

TL;DR: Integration friction and slow iteration cycles are the bottlenecks that actually stall fine-tuning — not the algorithm. We share the patterns we see across engagements, how teams like Cursor and Genspark broke through them, and where the workflow is heading: toward fully agentic fine-tuning loops that close themselves. Most teams that come to us for fine-tuning are not struggling with the training algorithm. They are struggling with everything around it: getting reward functions to talk to internal APIs without leaking data, waiting days between experiments because each step lives in a different tool, and figuring out whether the problem even calls for SFT, RFT, or DPO. Over the past year, working with a select group of the most innovative startups, digital natives, and Fortune 500 companies, we have seen these patterns repeat across every engagement. Every team that comes…

28dModel#fine-tuning#training

33d ago

3/23/2026 Frontier RL Is Cheaper Than You Think

On this page The conventional wisdom on RL infrastructure is wrong, and it is costing teams that could be competing at the frontier. The entire mega-cluster narrative rests on a single assumption: that you have to ship 1 TB of weights every time you update your rollout fleet. You do not. Researchers have spent the last year writing about asynchronous RL and rollout-training disaggregation in systems like AReaL. Teams like Kimi and MiniMax have also published engineering notes on RL parameter updates and asynchronous scheduling. We have been running that pattern in production. That mega-cluster instinct comes from pretraining, where the main systems problem is keeping one huge synchronous training job saturated. RL is a different problem. The question is not just how to run the trainer. It is also how to keep a large rollout fleet generating data from…

33dInfra#training

46d ago

3/10/2026 Training-Inference Parity in MoE Models: Where Numerics Drift

On this page Kernel fusions that are mathematically equivalent can still drift numerically. Here are the parity bugs we hit across both Kimi K2.5 serving and Qwen3.5-MoE training bring-up. When you train a model and serve it for inference, you expect them to agree. The same weights, the same input, the same output distribution. This training–inference numerical parity matters more than it sounds: For dense models, parity is relatively easy. Mixture-of-Experts models like Kimi K2.5, Qwen3.5-MoE, and DeepSeek V3 are harder. With routed experts, shared expert pathways, and all-reduce communication twice per layer across deep stacks, there are many places where "mathematically equivalent" optimizations produce numerically different results. This post catalogs the pitfalls we found. Each is a class of optimization that inference engines use for performance, but that can silently break numerical alignment. We found most of these while…

46dInfra#qwen#inference#training

48d ago

3/8/2026 Introducing Fireworks on Microsoft Foundry: Bringing Best-in-Class Open Model inference to Azure

We are excited to announce the Public Preview of Fireworks AI on Microsoft Foundry, bringing our best-in-class fast open-model serving directly into Azure. This partnership integrates Fireworks’ high-performance inference and State-of-the-Art (SOTA) open models into the unified Microsoft Foundry platform, which already offers a wide selection of models. By empowering developers with the fastest path to production-grade open-models, this milestone ensures teams using this new solution have one place to use any model, any framework, with enterprise‑grade controls to build and run AI applications and agents at scale. Across industries, organizations are increasingly standardizing on open models to get greater control over performance, cost, customization, and the security and compliance needed for enterprise deployment. With open models, teams can choose the right architecture per workload, bring their own weights, and fine-tune for quality, latency, and cost without provider lock‑in. Yet…

48dInfra#inference

48d ago

3/8/2026 Fireworks Acquires Hathora to Accelerate Global Compute Orchestration

Fireworks AI has acquired Hathora, and we're thrilled to bring their team and technology into the Fireworks family. Lin Qiao shared her excitement about the acquisition, noting, “Hathora’s intense focus on every millisecond and every routing decision is precisely the discipline required for cutting-edge AI inference.” Since the first multiplayer games appeared on the internet, lag has been the enemy. In gaming, milliseconds determine whether you win or lose. Speed isn’t a feature; it’s survival. AI inferences is entering that same era. Solving that requires a particular kind of team: engineers who obsess over systems, performance, and reliability at a global scale. From the beginning, Fireworks has set out to build an elite group of infrastructure engineers. People who care deeply about kernel performance, scheduling decisions, networking paths, and the invisible layers that make intelligent systems instantaneous. The Hathora team…

48dInfra#inference

54d ago

2/3/2026 The Benchmark Gap: What It Takes to Ship Kimi K2.5

The Benchmark Gap: What It Takes to Ship Kimi K2.5 Kimi K2.5 is live on Fireworks at ~1/10 the cost and 2-3x the speed of closed frontier models. As the fastest open-source provider of Kimi K2.5, Fireworks is seeing unprecedented model adoption. Kimi K2.5 is a landmark release for open models with benchmark results on par with top closed models and unprecedented visual coding quality. But enabling full quality in production requires more than just hosting the model. Here's how Fireworks ensures that developers get the best quality on our platform and how that translates into specific edge cases. How We Approach Quality at Fireworks Deploying frontier open models has taught us that quality emerges or degrades in the gaps: between the model and serving stack, between the chat template on Hugging Face and what’s running in the first-party API.…

54dResearch#inference#multimodal#benchmark

57d ago

2/27/2026 The DeepSeek Model Lineup: V3.2, R1, and Distilled Variants Mapped to Production Workloads

Key Takeaways deepseek-chat and deepseek-reasoner now both point to V3.2, so any team routing to those endpoints without pinning a version is hitting a different model than they think.tool_calls arrays on distilled variants; we resolve these at the platform level on Fireworks On-Demand, which delivers ~250% better throughput and 50% lower latency than vLLM.As most AI developers are well aware, DeepSeek has become one of the defining companies in the open-weights AI ecosystem. Founded in 2023, the Chinese lab made global headlines in January 2025 when the release of R1 triggered one of the largest single-day market sell-offs in recent memoletry — wiping billions from Nvidia, Broadcom, and ASML as investors confronted an uncomfortable reality: that a Chinese lab operating under strict GPU export controls had managed to train a frontier-competitive model with orders of magnitude less compute than anyone…

57dTutorial#inference

85d ago

1/30/2026 The Missing Piece of the OpenClaw Mania: Truly ‘Own Your AI’ with Fireworks AI

Building a "Personal Operating System" means nothing if you don't control the brain. Move your OpenClaw agent onto secure, cost-efficient, and fully private infrastructure. The recent explosion of interest around OpenClaw (formerly Moltbot or Clawdbot) has been incredible to watch. We are finally moving past simple chatbots and into a true agentic future—where an AI can handle your emails, manage your calendar, and act as a genuine extension of yourself. It's the dawn of the personal AI operating system. But there is a massive contradiction at the heart of the current OpenClaw phenomenon. Many are building a highly intimate "personal OS" that has access to your most private data—your messages, your files, your digital life—yet most users are piping that data straight into "black box" APIs from closed-source model providers. You get convenience, but you lose control. You don't know…

85dInfra#fine-tuning#inference#open-source

88d ago

1/27/2026 Build powerful agents on OSS models with Blazing Fast Inference on Fireworks

Kimi K2.5 just dropped yesterday and is available Day 0 on Fireworks! As open models get more powerful and agentic, low latency enables complex, multi-step AI agents to be usable in real-time. Fireworks is fastest across the top open models among GPU-based providers, as benchmarked by Artificial Analysis. Get the top speed you need, tailored for your use case, only with Fireworks. Stay tuned as our engineering team continues to optimize the performance! Fireworks' customization engine and virtual cloud infrastructure are engineered to deliver best-in-class performance for developers. We've built the following advanced capabilities to enhance speed as seen in the benchmarks, and across multiple different usecases: FireOptimizer maximizes performance by optimizing three core dimensions: This ensures your hardware is precisely right-sized to meet the specific Service Level Agreements (SLAs) of your application. For applications where latency is critical, such…

88dResearch#inference#coding#benchmark#open-source

89d ago

1/26/2026 Kimi K2.5 is Live on Fireworks: Vibe Coding, Agents, and Full-Parameter RFT

Kimi K2.5 is Moonshot AI’s flagship agentic model and a new SOTA open model. It unifies vision and text, thinking and non-thinking modes, and multi-agent execution into one model. We are launching Day-0 support for Kimi K2.5. Fireworks offers the fastest endpoint for all Kimi K2 series models as well as fine tuning for Kimi K2 models. Additionally, we now offer a full parameter RL tuning private preview for Kimi K2.5, enabling application builders to fine tune the SOTA OSS VLM model for use cases like vibe coding and agentic workflows. Sign up for the full parameter RL tuning waitlist here. Kimi K2.5 demonstrates that open source models are now surpassing their closed-source counterparts. The chart provides more details on the multiple benchmarks where Kimi K2.5 achieves SOTA results, including for Agents (HLE Full, BrowseComp, and Deepsearch) and for Vision…

89dInfra#agents#fine-tuning#inference#multimodal

92d ago

1/23/2026 Turning Production Logs into Evaluation Datasets: A Data-Driven Approach

If you are running an LLM in production, you have access to the most valuable resource for improving your model: your actual user traffic. Most teams know they need to run evaluations, but creating a high-quality evaluation dataset from scratch is difficult. Manually writing examples is time-consuming and often misses the nuances of how people speak. On the other hand, using your raw production logs directly isn't feasible; there is simply too much volume, noise, and redundancy to run model-based evaluations on everything. We believe the best evaluation datasets are inspired by production usage. They should reflect the reality of what your users are asking, without the overhead of processing every single log. Here is how we approach creating representative evaluation datasets from production traces. The challenge with raw production data is that it is unstructured. You might have 10,000…

92dResearch#fine-tuning#inference#open-source

102d ago

1/13/2026 Best Open Source LLMs in 2026: We Reviewed 7 Models

With new open source LLMs launching nearly every week, figuring out which model actually fits your use case has become its own research project. Models like DeepSeek v3.2, Kimi K2.5, and Qwen3 VL now compete at the frontier. Each brings distinct strengths in reasoning, multimodal understanding, and efficiency. The stakes are higher than benchmark scores suggest. Your choice of foundation model shapes inference costs, response latency, and the quality your users experience in production. This roundup compares the top open source LLMs available today, breaking down their trade-offs and ideal applications so you can make an informed decision without running your own eval suite. TL;DR MoE architectures dominate this generation: trillion-parameter models that activate only 10B to 40B parameters per token. All seven models in this roundup are available on Fireworks for instant serverless inference. Architecture efficiency is the single…

102dResearch#qwen#benchmark#open-source

115d ago

12/31/2025 DPO, your simplest RL pipeline with two rollouts

A recent research paper, "IT TAKES TWO: YOUR GRPO IS SECRETLY DPO", bridged DPO and GRPO by framing both DPO and GRPO under the same contrastive loss form, and experimentally verified that sometimes GRPO with group size 2 can perform reasonably well. In this blogpost, we conversely claim that under a more on-policy setting, you can setup a reasonably well functioning recurring / continuous model training pipeline with one-off DPO training, that can be as powerful as RL. DPO (Direct Preference Optimization) and GRPO (Group Relative Policy Optimization) are both powerful LLM fine-tuning techniques that allow models to be tuned towards generating better responses. In the DPO setup, one is expected to provide a dataset where each row contains a prompt and two responses. Among the two responses, one is preferred and the other is dispreferred. For example, I could…

115dInfra#fine-tuning#inference#open-source

129d ago

12/17/2025 Self-Improving Agents, Powered by Your Evals

TL;DR: Eval Protocol is a unified eval interface that powers both prompt optimization and RL on the same evaluation function. Author your evaluator once, and instantly unlock huge gains on open-source models. Imagine: You’ve built an eval suite that tells you where your agent fails, but it doesn’t tell you how to fix it. So you tweak the prompt—add a few instructions, maybe some examples, re-run the evals, watch the number wobble up or down. After a while, it’s just hours of staring at failure cases and guessing what the next prompt change needs to be. Eval Protocol is introducing a new integration to help eliminate some of this prompt guesswork: GEPA inside EP. GEPA runs on your existing eval setup—datasets, metrics, and task constraints—to convert failure signals into precise, enforceable prompt improvements and deliver automatic, inspectable gains to your…

129dResearch#benchmark#open-source

131d ago

12/15/2025 NVIDIA Nemotron 3 Nano on Fireworks: The Engine for Next-Generation AI Agents

We're excited to launch Day-0 support on Fireworks for the latest model in the NVIDIA Nemotron family, NVIDIA Nemotron 3 Nano, an advanced reasoning model set to fuel the next generation of AI Agents. This model is a small, powerful, hybrid Mixture-of-Experts (MoE) model built for developers who need maximum compute efficiency and cutting-edge accuracy for specialized agentic systems. The model builds on the Nemotron 2 Nano release combining a new Mixture-of-Experts architecture with the Nemotron hybrid transformer-mamba architecture. The MoE design reduces compute overhead to meet the tight latency demands of real-world applications. For leading accuracy, Nemotron 3 Nano is a 30B parameter model with 3B active parameters for inference, and a large 1M context length that has been trained using NVIDIA-curated, high-quality synthetic data from expert reasoning models. Along with the MoE architecture it features a new token…

131dInfra#rag#inference#gpu

137d ago

9/12/2025 Understanding Embeddings and Reranking at Scale

Retrieval-Augmented Generation has emerged as the dominant paradigm for grounding large language models with external knowledge. Yet the quality of any RAG system fundamentally depends on its ability to retrieve the right information at the right time. This challenge has driven significant advances in two critical technologies: embeddings and reranking. Understanding their technical foundations and architectural implications is essential for building production-grade RAG systems that can handle real-world complexity. The journey from traditional keyword search to modern semantic retrieval represents a fundamental shift in how machines understand and retrieve information. To appreciate why embeddings and reranking matter, we must first understand the limitations they address and the complementary strengths of different retrieval paradigms. Traditional keyword-based search, epitomized by algorithms like BM25 (Best Matching 25), operates on lexical matching principles. BM25 scores documents based on term frequency (TF) and inverse document…

137dInfra#fine-tuning#inference#embeddings#open-source

138d ago

8/12/2025 Quality first: how Fireworks.ai is the go-to place for gpt-oss

It’s been an incredible week for the open-source AI community. The release of GPT-OSS marked a significant milestone, opening up new possibilities for developers and researchers worldwide. This is especially exciting as it is released from a US frontier lab 🇺🇸. At Fireworks.ai, we believe that making a model available is only the first step. The real work lies in making it reliable, performant, and truly production-ready. From the moment GPT-OSS was released, our team worked tirelessly not just to host it, but to provide the single best implementation available anywhere. Our "Quality First" approach meant diving deep into the code, identifying critical issues, and deploying robust fixes to ensure our partners and the entire community could build on a solid foundation. We were proud to help power the official demo site at gpt-oss.com and support the Hugging Face team…

138dInfra#fine-tuning#inference#open-source

140d ago

6/12/2025 Vision Model Platform Updates: Enhanced Capabilities and New Features

Enterprises process massive amounts of unstructured visual data daily—from scanned documents and medical records to product images and screenshots. Traditional text-only models leave this rich visual information untapped, missing opportunities to build rich digital experiences and unlock new business value. Many applications use the vision model platform on Fireworks to solve problems very innovatively. We have seen many moonshot projects deemed impossible a year ago become a reality. Fireworks provides a convenient OpenAI-compatible API to access VLMs. You simply specify the input image and the text prompt in the same multi-turn chat context as other models. In this example, we can see that we are using Qwen 2.5 VL to generate ecommerce product descriptions from their images, and also tasking it with the downstream task of localization into several languages. When you combine vision models with the rest of Fireworks'…

140dInfra#inference#multimodal

141d ago

5/12/2025 Supervised Fine-Tuning (SFT) with LoRA on Fireworks AI: Tutorial

Supervised Fine-Tuning (SFT) is critical for adapting general-purpose Large Language Models (LLMs) to domain-specific tasks, significantly improving performance in real-world applications. Fireworks AI facilitates easy and scalable SFT through its intuitive APIs and support for Low-Rank Adaptation (LoRA), allowing efficient fine-tuning without full parameter updates. LoRA significantly reduces the computational cost of fine-tuning large models by updating only a small subset of parameters in a low-rank structure, making it particularly suitable for large models like LLaMA or DeepSeek. qLoRA (Quantized LoRA) further improves efficiency by enabling fine-tuning of 4-bit and 8-bit quantized models (dependent on model types) without sacrificing performance, reducing memory requirements even more. Fireworks AI supports both LoRA and qLoRA tuning, allowing up to 100 LoRA adaptations to run simultaneously on a dedicated deployment without extra cost. Step-by-Step Guide to Fine-Tuning with Fireworks AI Go to fireworks.ai >…

141dTutorial#fine-tuning#inference

143d ago

3/12/2025 Fine-Tuning DeepSeek v3 & R1 to optimize quality, latency, & cost

At Fireworks, we’re happy to announce customization of DeepSeek R1 & V3, through Quantization Aware Fine Tuning, is now available as part of our FireOptimizer adaptation engine. You can now tailor the behavior of these state-of-the-art open models specifically for your use case, and optimize for quality, latency & cost. You can deploy the tuned models with one-click on a dedicated deployment on Fireworks. To get started, reach out to your Fireworks representative, or contact us. DeepSeek R1 and V3 are state-of-the-art open models that excel at a variety of tasks including chat, code generation and reasoning over complex tasks, but fine tuning these models has proven very challenging: Accuracy drops from different training and serving configurations - DeepSeek is natively tuned and designed for FP8 serving. For nearly all existing LoRA methods, LoRA weights are assumed to be in…

143dInfra#fine-tuning#inference

152d ago

11/24/2025 Fireworks Expands AWS Alliance: Strategic Collaboration Agreement + GenAI Competency

Today, we’re excited to share a major milestone in our work to help every company own their AI: Fireworks has signed a Strategic Collaboration Agreement (SCA) with AWS, to advance joint go-to-market efforts, bringing highly performant, scalable open source AI solutions to customers. We have also achieved the AWS Generative AI Competency Partner status, a recognition of our expertise helping customers train, fine-tune, and deploy GenAI workloads on AWS. Builders on AWS have used Fireworks to bring custom models into production with higher performance and lower costs. Earlier this year, we launched native integrations for Amazon SageMaker AI and Amazon Bedrock AgentCore, allowing developers to train, fine-tune, and deploy their models inside the AWS workflows they already trust. This new collaboration builds upon that foundation, expanding it into a comprehensive set of resources to support our joint customers at scale.…

152dInfra#inference#open-source

156d ago

11/20/2025 Eval Protocol: RL on your agents, in any environment

Eval Protocol (EP) is an open-source, language-agnostic framework that makes it easy to do reinforcement fine-tuning on agents, across any framework, environment, or trainer. Your agents, infrastructure, and training needs will evolve as you scale. Eval Protocol is designed to grow with you: migrate from local experiments to remote training, try different environments, support more agent types, or extend to multiple use cases with the same training setup. Eval Protocol lets you set up a lightweight interface between your agent environment and trainer; your agent stays unchanged, and the training setup plugs in seamlessly. Get set up for RL with your agent today. Fireworks is open-sourcing Eval Protocol because standardizing agent evaluation for RL benefits the entire ecosystem. The most valuable AI infrastructure has historically been open source—from our team’s roots at PyTorch to Transformers to RL frameworks. We believe…

156dResearch#agents#fine-tuning#open-source

157d ago

11/19/2025 Fireworks Achieves Triple ISO Certification, giving Enterprises Full Control and Trust in AI at Scale

AI adoption is accelerating, but enterprises face risk without verifiable security, privacy, and governance. Fireworks delivers on all three. We are proud to announce that Fireworks has achieved ISO 27001, ISO 27701 and ISO 42001 certifications, the leading global standards for information security, privacy management, and responsible AI governance. Few AI infrastructure providers hold all three, signaling a new benchmark in enterprise trust. Enterprises are under increasing scrutiny as regulators move quickly: through 2025, all 50 U.S. states have introduced AI legislations or proposals (NCSL), highlighting that responsible, secure AI is no longer optional – it’s required. These certifications signal rigorous processes, independent audits, and continuous improvement, giving enterprises confidence that their data, models, and workflows are protected, and empowering them to own and govern their AI with confidence. What We Protect How We Protect it Why It Matters To…

157dInfra#fine-tuning#inference#open-source

157d ago

11/19/2025 50 Trillion Tokens Per Day: The State of Agent Environments

TL;DR — Agents and LLMs are processing 1.5 quadrillion token per month, and reached a massive scale over the past year. But the real story for the next 12 months isn't about which models are smartest—it's about the complex production environments where agents actually do work, optimizing not only the underlying models but the tools, workflows, and data in their environments. What emerges is a clear hierarchy where the ability to create high-quality environments is a determinant of market success—the companies building complete environments rather than just LLM wrappers are capturing the most value. For the last two years, the conversation around AI agents has been dominated by potential. Today, that conversation has fundamentally shifted from potential to production. Businesses have moved beyond prototyping, shipping agents that handle customer support, write enterprise-quality code, and manage complex workflows at scale. The…

157dAgents#agents

169d ago

7/11/2025 Understanding Function Calling: The Bridge to Agentic AI

Large language models (LLMs) have revolutionized natural language processing by generating impressive text based on massive pretraining and strategic alignment with user preferences during post training. However, their inherent limitation is that, while they excel at generating human-like language, they lack the ability to access or update real-world information on demand. This is where function (or tool) calling comes into play. What is Function Calling? Function calling refers to the process by which an LLM detects that a user request requires external data or action and then produces a structured output (typically in JSON) that specifies which function to call along with the necessary arguments. For example, instead of simply generating text to answer "What is the weather in London?" an LLM equipped with function calling can output a JSON object that triggers a weather API call. Once the external…

169dInfra#agents#fine-tuning#inference#open-source

170d ago

6/11/2025 Building AI agents with the Fireworks Experimentation Platform (GA) and Build SDK (Beta)

When building AI agents, the best AI companies are jointly developing their product and models in a process of rapid, continuous iteration. Just as we saw the rise of CI/CD pipelines in software, we now see a similar pattern emerging for building AI systems. This development lifecycle has four essential steps: Each step, however, has its own challenges that slow you down. Complex infrastructure setup and failures, time spent waiting for GPUs, reconciling differences between training and serving (both in data and use cases), and maintaining service reliability in production are all pain points that affect the iteration velocity of AI teams. To help address these challenges, we’re excited to announce the GA of the Fireworks Experimentation Platform – designed to supercharge your experimentation velocity by reducing your iteration time from weeks to hours. The experimentation platform offers powerful capabilities…

170dResearch#inference#benchmark

176d ago

10/31/2025 Genspark’s Deep Research Agent Outperforms a Frontier Closed Model in Quality and Tool Calls using Fireworks RFT, Achieving a 50% Cost Reduction

Genspark, a leading innovator in the AI-powered application workspace, excels at delivering Agentic Full-Stack Web Applications, ranging from advanced Deep Research Search Agents to AI-generated Slides, Documents, and Sheets. By leveraging Fireworks’ Reinforcement Fine Tuning to train large state-of-the-art open models, in one month Genspark achieved 12% better quality and 33% more tool calls than a state of the art (SOTA) closed-source model , leading to a 50% cost reduction and superior answer quality. *Note: These were measured by Fireworks AI 'Deep Research' AI agents are designed to perform complex, multi-step research tasks autonomously. The Genspark Deep Research tool is an AI-powered agent designed to automate and streamline the process of conducting thorough, multi-source investigations and generating comprehensive, structured reports on a given topic. Genspark’s Deep Research tool puts together a team of the best AI Models to collaborate for…

176dResearch#agents#inference

179d ago

10/28/2025 We raised $250M To Help Enterprises Own Their AI

Three years ago, before AI had taken over the world, my co-founders and I made a bet. We believed the future of AI wouldn’t be controlled by a handful of powerful foundation model labs, but distributed across thousands of enterprises that want to own and customize their own AI products. That founding thesis has paid off. Today, we’re announcing a $250 million Series C at a $4 billion valuation, co-led by Lightspeed Venture Partners, Index Ventures, and Evantic, with continued support from Sequoia Capital. This round, which includes primary and secondary funding, brings our total funding to over $327 million, with prior rounds led by Benchmark and Sequoia, and strategic participation from NVIDIA, AMD, MongoDB and Databricks. We raised this capital to meet surging enterprise demand for our production AI infrastructure and to cement our position as the market leader…

179dInfra#inference

180d ago

10/27/2025 Accelerate your Vision Pipelines with the new NVIDIA Nemotron Nano 2 VL Model on Fireworks AI

Exciting news for vision AI! Fireworks is proud to offer Day-0 support for the highly anticipated NVIDIA Nemotron Nano2 VL, a 12B multimodal reasoning model for accelerating your document intelligence and video understanding applications. NVIDIA Nemotron Nano2 VL, the latest innovation in the NVIDIA Nemotron family, is a vision language model (VLM) designed to push the boundaries of intelligent document processing, AI assistant video understanding, video captioning, multi-modal agentic workflows, and more. It enables AI assistants to extract, interpret, and act on information across text, images, tables, and video. VLMs are built by combining an LLM with a vision encoder, enabling the LLM with eyes. VLMs often require a more complex architecture to integrate across multiple modalities. With Fireworks' Multimedia, developers can effortlessly unlock insights across various modalities from VLMs like NVIDIA Nemotron Nano2 VL, bypassing the complexities of unstructured…

180dInfra#agents#inference#multimodal#gpu

184d ago

10/23/2025 Deployment Shapes: One-Click Deployment Configured For You

Configuring the right LLM serving set-up can be a headache for developers. There are a variety of optimizations that can be tweaked to balance speed and cost, ranging from quantization level, hardware choice, speculation technique and model sharding specifics. Historically, Fireworks worked closely with our customers to manually implement all these optimizations, pushing the limits of what is possible and optimizing between latency, throughput, and cost for the customer’s use case. We make this infrastructure accessible to you in a few easy ways. The easiest way to start using Fireworks is with our serverless deployments, which let you make requests to the most popular models without any setup. Serverless deployments are preconfigured with one set-up for everyone. That makes serverless very easy to use, but it also means that serverless might not be optimal for your goals, especially if you…

184dInfra#inference#coding

187d ago

10/20/2025 Fireworks and AMD partner to power the next gen of AI infrastructure on AMD Instinct™ GPUs

Fireworks and AMD have entered into a multi-year strategic agreement to optimize AMD Instinct™ GPUs and accelerate adoption across AI-native companies, developers, and enterprises. We’re excited to share this new chapter in Fireworks’ mission to power the next generation of AI inference workloads. Our collaboration brings together AMD’s leadership in high-performance computing and Fireworks’ advanced AI stack to deliver scalable, production-grade AI systems that run inference faster, with the best quality, for the most efficient cost. For every organization and workload, there is a sweet spot where price, performance, and speed meet a technical and business outcome. By partnering with AMD, Fireworks provides best-in-class optimization technology alongside AMD Instinct™ GPUs. From model-serving runtimes to training frameworks, Fireworks is working closely with AMD to optimize every layer of our software stack for AMD Instinct™ MI325X and MI355X accelerators.. Tuning the Fireworks…

187dHardware#inference#coding#training

192d ago

10/15/2025 LLM on the edge: Model picking with Fireworks Eval Protocol + Ollama

Modern AI apps rarely run on a single model forever. Teams iterate, swap providers, and increasingly run open-source models locally for privacy, latency, and cost. This post shows how to use Fireworks Eval Protocol to do robust model picking and how to host models locally with Ollama so you can replace OpenAI usage at scale—without rewriting your app logic. We'll walk through two real examples in this repo: •End-to-end agent evaluation on the Chinook dataset (PydanticAI) •LLM-judge over Langfuse traces you already have in production The core idea: keep your evaluation harness the same; only swap the model backend using an OpenAI-compatible endpoint (Ollama). Why this approach works •Standard interface: Eval Protocol treats models as swappable via completion_params (model name, provider, base_url, etc.). •OpenAI-compatible: Ollama exposes an OpenAI-style API locally, so clients keep working with only config changes. •Evidence-based model…

192dInfra#llama#fine-tuning#inference#open-source

195d ago

12/10/2025 Best Practices for Multi-Turn RL

How to train LLM agents that can reliably plan, call tools, and recover from their own mistakes. Introduction In the evolution of AI agents, we are witnessing a distinct phase shift. We are moving past the era of simple, single-turn query-response interactions implemented in simple assistants into a domain defined by sequential agency. We are no longer just asking models to answer a question; we are asking them to interface with an environment through multiple interactions to complete complex tasks that are multi-step, open-ended, and tool-heavy – e.g.,: •Planning a trip: search, filter options, compare itineraries, book flights, reserve hotels, and monitor changes. •Data investigations: run SQL queries, inspect anomalies, call monitoring dashboards, and generate follow-up hypotheses. •Research workflows: search, read, summarize, cross-check, and synthesize across multiple sources. These are multi-turn, sequential decision problems: the agent must decide which tool…

195dTutorial#agents#fine-tuning#training

196d ago

11/10/2025 Fireworks RFT: Build AI agents with fine-tuned open models that outperform frontier closed models

Fireworks RFT enables you to fine-tune frontier open models like DeepSeek V3 and Kimi K2 for your agentic product. Genspark beat frontier closed model quality by 10% in less than a month. Vercel achieved 94% error-free code generation and 40X faster speeds. Fireworks RFT is easy to use for application developers, enterprises and researchers alike - taking you from local evaluator to production in hours. Start training today — completely free through November 24, 2025 Fireworks RFT is a managed service for reinforcement learning. Train open models to excel at your product use case—multi-turn agents, coding, or complex reasoning. Fireworks RFT makes RL training accessible: no infrastructure to manage, and a developer-friendly workflow to securely connect production environments with training. Optimizing a model for your specific use case significantly enhances quality, accelerates performance, and reduces cost. Genspark: Training Deep Research…

196dResearch#agents#fine-tuning#inference#coding

200d ago

7/10/2025 Using Model-as-a-Judge for Reward in Reinforcement Fine Tuning

In domains that are inherently challenging to quantify, such as creative writing, we demonstrate that leveraging a superior large language model (LLM) as a judge can meaningfully improve the performance of the policy model. The Arena Hard Auto dataset encompasses tasks spanning creative writing, mathematics, and software engineering. The creative writing subset, for example, features prompts like the one below: Write a personal dialog about tension in a relationship, using these words: rocket, pollution, fitness, pierce, rational, fee, threaten, falsify, resource, treaty. Developing an effective rule-based reward function for dimensions such as style, diversity, and coherence is particularly challenging in creative domains. However, by utilizing a capable LLM as a judge, it becomes feasible to evaluate and compare responses with nuanced reasoning. In this blog, we discuss our training methodology and showcase some results. Download the Arena Hard dataset locally…

200d#rag#training

215d ago

9/22/2025 Traces Are All You Need (to rank LLMs)

From your existing observability platform logs to a data-driven model leaderboard in minutes – quickly compare candidate models with an LLM judge. Choosing the right AI model is a critical decision, yet it’s often a guess. Public benchmarks don't reflect the real-world trade-offs between cost, speed, and quality on your data. What if you could find the optimal model by building a leaderboard from your production logs in just five minutes? This post shows you how to find out using Eval Protocol, an open-source toolkit for building your internal model leaderboard. We’ll demonstrate a quick, no-ground-truth-required method and validate it by showing our results correlate strongly with the official Tau Bench Airline benchmark. While our Quickstart Guide covers the code, this article goes under the hood to explore the step-by-step methodology—inspired by Arena-Hard-Auto research—for turning raw logs into a validated…

215dInfra#fine-tuning#inference#open-source

226d ago

11/9/2025 Modernizing Healthcare with AI: How RADPAIR and Fireworks Unlock Smarter Radiology Workflows

Executive Summary RADPAIR is transforming radiology workflows with its intent to create an open-source SDK standard – anchored by the Report Document Schema (RDS) and Actions and Event Protocol (AEP) – which enables safe, intelligent AI interactions across reporting systems. RADPAIR’s SDK establishes a new AI orchestration standard for healthcare, designed for adoption by coalition partners to enable interoperable, safe, multi-agent workflows across institutions. Fireworks AI provides the enterprise-grade infrastructure and orchestration platform for RADPAIR’s fine-tuned models and multi-agent pipelines, ensuring real-time, scalable, and compliant performance. Today, radiologists at institutions including Radiology Partners, which handles 40-50 million cases annually, benefit from AI-assisted workflows that integrate real-time dictation and generative AI structured reporting, reducing cognitive load, accelerating throughput, and improving diagnostic confidence. Key performance gains observed in production include: By combining RADPAIR’s innovative AI orchestration with Fireworks’ scalable, low-latency infrastructure, the…

226dInfra#agents#fine-tuning#inference#open-source

227d ago

10/9/2025 Announcing Embeddings and Reranking On Fireworks AI

Today, we're announcing a major upgrade to Fireworks for RAG workloads – we’re bringing the state-of-the-art Qwen3 8B Embeddings and Reranking models to serverless, and are introducing two new API endpoints to make it all easily accessible. Now, whether you're building semantic search, recommendation systems, or agents powered by enterprise data, Fireworks makes it easier than ever to build scalable RAG applications with open models. At a glance, a RAG pipeline consists of five core stages: The problem: Until now, teams have had to cobble together different providers for embeddings, reranking, and generation. The result is a pipeline that is complex, inconsistent, and hard to scale. Even though open models are crushing the leaderboards on embeddings and reranking tasks, AI teams are forced to choose between the operational pain of self‑hosting these models and the cost of closed‑source model APIs.…

227dInfra#fine-tuning#inference#embeddings#open-source

230d ago

7/9/2025 Introducing FLUX.1 Kontext on Fireworks

We’re excited to bring FLUX.1 Kontext, a suite of generative flow matching models that can generate and edit images, to Fireworks. Unlike the text-to-image models, the FLUX.1 Kontext family can perform in-context image generation, allowing you to prompt with both text and images, and seamlessly extract and modify visual concepts to produce new, coherent renderings. FLUX.1 Kontext models are developed by Black Forest Labs (BFL) and mark a significant expansion of classic text-to-image models by unifying instant text-based image editing and text-to-image generation. As a multimodal flow model, it combines state-of-the-art character consistency, context understanding and local editing capabilities with strong text-to-image synthesis. As part of the FLUX.1 Kontext suite, we bring two new in-context image models to the Fireworks API: Kontext can preserve unique elements of an image, such as a reference character or object in a picture, across…

230dTutorial#inference#multimodal

231d ago

6/9/2025 Reinforcement Fine Tuning (Beta): Train expert open models to surpass closed frontier models

Today, we’re excited to announce the beta release of Reinforcement Fine-Tuning (RFT), a powerful new technique to create expert models for complex tasks across agentic reasoning, function calling, coding, and more. RFT can improve model quality with just a few examples. Compared with closed frontier models, our alpha users have been able to train open models to: Fireworks makes it easy to train expert models with RFT, by specifying an evaluator function that grades model outputs, with no infrastructure setup required! RFT on Fireworks supports frontier open models like Llama, Phi3/4, Qwen 2.5/3 and even DeepSeek V3 and R1. You can get started here. Training models using RFT on Fireworks is free of charge for the next 2 weeks! RFT works best for tasks with clear answers that can be graded or verified for correctness, by building on the concept…

231dAgents#agents#fine-tuning#coding

233d ago

4/9/2025 Building Enterprise-Scale RAG Systems with Fireworks AI and MongoDB Atlas

In the fast-paced world of enterprise data, extracting actionable insights from vast amounts of unstructured information is a challenge many organizations face. Whether it’s earnings calls, financial reports, legal documents, or technical specifications, the ability to retrieve, synthesize, and act on this information quickly can make or break a company’s competitive edge. Enter Retrieval-Augmented Generation (RAG) – a cutting-edge solution combining Large Language Models (LLMs) with powerful retrieval systems to deliver contextually rich, actionable insights in real time. Why Enterprises Need RAG Let’s face it: traditional search systems just don’t cut it anymore. They’re limited to keyword matching and fail to grasp the semantic and contextual relationships necessary for enterprise-scale decision-making. Here’s why RAG stands out: Multi-Format Analysis: RAG handles data from PDFs, Word documents, spreadsheets, and even audio recordings, breaking silos between formats. Cross-Document Insights: It synthesizes information across…

233dInfra#rag#inference

242d ago

8/26/2025 DeepSeek V3.1 now on Fireworks AI!

TL;DR DeepSeek V3.1 is a major leap forward in open‑source LLMs. It introduces hybrid reasoning modes (“thinking” vs. “non‑thinking”), and reduces hallucinations by around 38% compared to V3. With enhanced tool integration and expanded multilingual capabilities across 100+ languages, V3.1 is optimized for real‑world, agent‑centric applications. To truly leverage its power, especially in agentic workflows and long‑document analysis, you’ll benefit from experienced engineers integrating it with APIs, tool chains, and memory systems. What Makes DeepSeek V3.1 Better than V3? At its core, DeepSeek V3.1 expands on DeepSeek V3’s architecture with several major enhancements: - •Hybrid Reasoning Modes: Toggle between “thinking” (chain‑of‑thought) and “non‑thinking” (rapid reply) using chat templates. - •Massive Context Capacity: Standard 128K‑token windows, trained with 10× more data for 32K context and 3.3× more tokens in the 128K phase than V3. - •Lower Hallucination Rates: About 38% fewer…

242dInfra#fine-tuning#inference#open-source

243d ago

8/25/2025 LLM Eval Driven Development with Claude Code

In our previous blog, we showed how to go from one test to many tests with Eval Protocol with Cursor. But what if you're starting from scratch? Today, with Claude Code supercharged by MCP servers pointing directly to our docs and a deep wiki, we'll show you how to go from 0 to 1. In other words, from a completely blank project to your first fully tested AI agent. To recap the core idea from the previous blog, we're adapting the classic software engineering practice of Test-Driven Development (TDD) to use evals in the era of LLMs. The idea is simple: you write evals that define the desired behavior before writing the actual code, and then build your agent to pass them. This post will demonstrate how applying a TDD workflow ensures that as you add new features or swap…

243dInfra#claude#fine-tuning#inference#coding

253d ago

8/15/2025 Your AI Benchmark is Lying to You. Here's How We Caught It

Your AI Benchmark is Lying to You. Here's How We Caught It Would you give GPT-4.1 an A grade for this image? We sure wouldn’t! That’s exactly what our AI judge did, giving it a 93.3%. To its credit, it was a diligent box-checker, taking a list of 15 requirements and confirmed that, yes, there were colored shapes where the logo should be, and a box where the search bar should be. It was technically correct, but its misalignment to human expectations what matters. 1234567 EvaluationResult: { "score":0.9333333333333333, "is_score_valid":true, "reason":"1. The background is white. 2. Primary elements are horizontally centered. 3. The Google logo is in the center and uses the correct colors. 4. A prominent search bar is directly below the logo. 5. The search bar is a rounded rectangle with a light gray border. 6. The search bar…

253dResearch#fine-tuning#inference#benchmark#open-source

254d ago

8/14/2025 Test-Driven Agent Development with Eval Protocol

Building AI agents is exciting, but let's be honest: they can be unpredictable. How do you add new features without secretly breaking old ones? How do you debug a complex, multi-turn conversation and prevent regressions? At Fireworks, we believe the answer lies in a familiar engineering practice: Test-Driven Development (TDD). We've developed the Eval Protocol, a pytest-centric framework designed to bring structure and reliability to agent development. In this blog, we'll walk you through an end-to-end journey of building a digital store concierge agent. Using Cursor as our AI coding partner and Eval Protocol for testing, we'll show you how to go from a rough idea to a well-tested, trustworthy agent. Every great project starts with an idea. I wanted to build a digital store concierge that could answer questions about the classic Chinook’s music database. Instead of writing a…

254dInfra#agents#fine-tuning#inference#open-source

268d ago

7/31/2025 Run bulk async workloads with Fireworks Batch API

With Fireworks’ Batch API, you can asynchronously run large volumes of requests on 1000+ open or finetuned models with no rate limits, 50% lower cost, and a 24-hour turnaround time. This is helpful for use cases like: - •Evaluations: Benchmark across models to identify the best model for your use case - •Data generation: Generate bulk outputs using large models to fine-tune smaller models - •Data augmentation: Create paraphrases, sentiment labels, or question-answer pairs at scale - •ETL Pipelines and Daily bulk processing: Process large numbers of documents daily without worrying about rate limits To use the Batch API, you simply upload your dataset in JSONL batch format and kick off a Batch API job. You can then check in on the status of your request, and retrieve the results once they are ready. The Batch API has the following…

268dInfra#fine-tuning#inference#open-source

269d ago

7/30/2025 Fireworks Real-World Benchmarks: Find the Best OSS Model for the Job

The open-source model landscape is exploding, making it hard to choose the right model. To help you cut through the noise, Fireworks AI is sharing real-world benchmarks on recent model releases based on tasks we've seen in production. Our initial findings show Qwen Instruct excels at knowledge-heavy tasks, Qwen3 Coder is a strong contender for simple tool-use, and Claude Sonnet 4 remains the leader for complex, multi-step agentic workflows. Dozens of new open-source models have been released in the past few weeks alone. While this rapid innovation is exciting, it creates a significant challenge for developers and businesses: which model is actually the best for your specific use case? All models claim to top the latest benchmarks but how do the models actually perform on real-world tasks, like classifying customer support tickets, powering an e-commerce search, or running a complex…

269dResearch#fine-tuning#inference#benchmark#open-source

270d ago

7/29/2025 Introducing Vision-Language Model Fine-tuning: Tailor VLMs to Your Domain

Fireworks AI now offers supervised fine-tuning for Vision-Language Models (Qwen 2.5 VL family), letting you adapt state-of-the-art VLMs to your specific visual domain. Train models on your images and text data to achieve higher accuracy for specialized tasks like medical imaging, financial document analysis, or product cataloging. Built for production with optimized kernels, 64K context support, and deployment on the same platform powering Cursor fast-apply. Enterprises across healthcare, finance, and ecommerce accumulate massive amounts of domain-specific visual data—from medical imaging and financial documents to product catalogs. Vision-Language Models can understand and reason about both images and text simultaneously, unlocking applications like automated document processing, visual Q&A, and multimodal workflows. While general-purpose vision-language models are powerful, they often miss the nuanced patterns and terminology specific to an industry. Fine-tuning VLMs on your domain-specific data dramatically improves accuracy for specialized visual tasks…

270dInfra#fine-tuning#inference#multimodal#open-source

274d ago

7/25/2025 How Notion Cuts Latency 4x and Scales Enterprise AI Workflows with Fireworks AI

Notion’s journey from individual users to enterprise powerhouse showcases how Fireworks AI enables scalable, reliable, and efficient AI experiences for over 100 million users, including nearly 70% of Fortune 100 companies. “Not everyone at Notion is an AI expert, but every engineer needs to be fluent in how to work in this AI landscape,” explains Sarah Sachs, Head of AI Engineering at Notion. With an expanding enterprise customer base, Notion needed AI that could do more than answer questions. They needed sophisticated AI agents to reliably integrate with complex workflows across tools like Slack, Jira, and GitHub. “Our users expect AI that helps them move naturally from meetings to tasks, not just a chat experience,” Sarah adds. “That transition from Q&A to agentic workflows is essential for ‘vibe working’ — our vision for how work should flow.” The stakes were…

274dInfra#agents#fine-tuning#inference#open-source

277d ago

7/22/2025 A Deep Dive into MLA training/inference difference and why QK-Clip from Kimi is such an elegant idea

Today, we're unpacking a clever insight from the researchers behind Kimi K2, a powerful LLM from Moonshot AI. This all started from a fascinating exchange in the comment section of a technical blog post. We'll break it down step by step, with real math to appreciate the elegance, but I'll explain it like we're chatting over coffee. By the end, you'll see why this "QK-Clip" trick is so smart and how it makes models like Kimi more reliable for your apps. Anecdotally, I have heard whispers on the street that there are quality trade-offs with using MLA, and this may be the secret ingredients that some of the top labs have been missing, paving the future for inference to be more efficient across the board. Our story begins on the comment section of a blog post by Su Jianlin (苏剑林)…

277dInfra#fine-tuning#inference#training#open-source

277d ago

7/22/2025 VibeRL: When AI Trains AI

Reinforcement Learning (RL) isn't new. Think about it like training a pet - you give a command, your pet performs an action, and if it's correct, it gets a treat. Over time, your pet learns exactly what you want. The same idea has been quietly revolutionizing AI. Techniques like Proximal Policy Optimization (PPO) played a huge role in early successes such as ChatGPT. But honestly, these early methods weren’t easy. You had to juggle multiple models, tweak countless hyperparameters, and sometimes even then, things would just break. Things got simpler with methods like Group Relative Policy Optimization (GRPO), which reduced some complexity. But even then, RL remained tricky - designing reward functions and fine-tuning was more art than science. (We previously discussed how models can judge each other in our post, "Model as a Judge"). Recently, "Vibe Coding" - AI…

277dInfra#fine-tuning#inference#open-source

282d ago

7/17/2025 Sentient & Fireworks Powers Decentralized AI At Viral Scale

Backed by $85 million from Founders Fund, Pantera, Framework Ventures, and Polygon Labs, Sentient unites Sandeep Nailwal (Polygon), Himanshu Tyagi (Witness Chain), and Princeton professor Pramod Viswanath at the helm. Their Princeton-driven research team is chasing a single, audacious goal: deliver the ultimate AI experience by fusing the planet’s collective intelligence into one open, decentralized network. Powered by blockchain and open-source models, Sentient turns transparency into a feature and democratizes AI for everyone. At the helm of product, Technical Product Manager Oleg Golev leads the charge in bringing that vision to life – starting with Dobby, an open-source family of LLMs showcasing AI loyalty at the model layer, fine-tuned to be loyal to personal freedom and the crypto community. The models possess unique qualities (distinct personality traits and human-like tone) that make it a perfect choice for content virality, while…

282dInfra#fine-tuning#inference#open-source

284d ago

7/15/2025 Fireworks AI Now Supports Amazon SageMaker

We’re thrilled to announce Fireworks AI has now made Amazon SageMaker available as a Bring Your Own Compute (BYOC) deployment option. This integration allows developers and enterprise ML teams to train models using SageMaker, and leverage Fireworks’ high-performance, low-latency inference platform for model serving — all within their existing AWS environment. As organizations embrace Generative AI at scale, they’re hitting the same roadblocks: training and experimentation in SageMaker is made seamless, but production-grade inference requires custom platform, performance tuning, and ongoing cost management. That’s where Fireworks comes in. Fireworks is the fastest inference and AI platform that enables customers to build magical AI Applications. Fireworks offers: Now, with Amazon SageMaker as a deployment option, customers can benefit all within their AWS environment. With Fireworks' deployment on Amazon SageMaker, you can: All of this while retaining full control over your data,…

284dInfra#fine-tuning#inference#open-source

284d ago

7/15/2025 Deep-dive into MuonClip: Fixing Attention Score Explosions in Transformer Training

Interactive visualization for MuonClip, brought to you from Fireworks.ai With the release of Kimi-K2, a state of the art tool calling and instruction following model, Kimi team also talked about how they scaled up their pre-training, with a new optimizer, MuonClip. Honestly we don’t see new optimizers that often, so let’s dive into this a little more to understand how this helped the Kimi team scale their training. Specifically, this was the part of the blog https://moonshotai.github.io/Kimi-K2/ related to MuonClip. So for people who are bad at math like me, what are they talking and how exactly does it solves their scaling problem. Before we hit the problem, let's recall how attention works in transformers (the backbone of most LLMs like GPT or Llama). Attention lets the model "focus" on relevant parts of the input sequence. It does this by…

284dInfra#fine-tuning#inference#training#open-source

297d ago

2/7/2025 DeepSeek v3 and R1 Model Architecture: Why it's powerful and economical

DeepSeek v3 and R1 continue to use the traditional Transformer block, incorporating SwiGLU, RoPE, and RMSNorm. It also inherits Multi-head Latent Attention (MLA) and radical Mixture-of-Experts (MoE) introduced by DeepSeek v2. But what makes DeepSeek v3 so remarkable? Despite compute limitations, it leverages the scaling law by adopting a more aggressive MoE and utilizing FP8 precision for training. As we all know, linear layers of Feed-Forward Network are low-rank in nature (That’s why LoRA performs exceptionally well), that most parameters in the FFN are not as important. That leaves optimization opportunities: how to only activate the useful parameters for each incoming prompt? The result is a sparsely-activated model, more famously known as Mixture of Experts (MoE). (MoE does not seem like the most appropriate name, since the MoE under LLM context emphasizes more on sparsity than expertise. There aren’t any…

297dInfra#inference#training

307d ago

6/22/2025 Unlock Your Tools: Fireworks Adds OpenAI-Response API with MCP Support (Beta)

TL;DR: Fireworks now supports an OpenAI-response API endpoint that allows you to connect our library of leading open models to your own tools and data using the open Model Context Protocol (MCP). The Unconnected LLM: A Walled Garden Large Language Models are incredibly powerful, but out of the box, they exist in a vacuum. They can't check your inventory, update a customer's order, or query your internal database. To make them truly useful for your business, they need to securely interact with your proprietary APIs, tools, and data sources. Historically, this required developers to build complex, brittle "glue code." You'd have to orchestrate a multi-step dance: prompt the model, parse its output to see if it wants to use a tool, make the API call yourself, and then feed the result back to the model. This process is slow, error-prone,…

307dInfra#fine-tuning#inference#open-source

313d ago

6/16/2025 Build for Scale with Fireworks Virtual Cloud (GA)

Anyone who has run a production application at scale knows the impact that performance and reliability has on product success. For AI applications, the challenge is often to successfully operate a fleet of GPUs that handles scaled, globally distributed traffic, potentially in the midst of unprecedented growth. A few factors make managing bare-metal GPU deployments on your own difficult: Ultimately, these distract your team from what matters: building winning product experiences for users. That’s why today we’re excited to announce the GA of the Fireworks Virtual Cloud, a platform that abstracts away the complexity of managing GPU deployments, handling hardware failures, and scaling workloads across a global fleet. Launching with over 18 global regions across 8 cloud providers, including support for BYOC, Fireworks Virtual Cloud lets you build for scale from Day 1. To get started with Fireworks Virtual Cloud,…

313dHardware#inference

315d ago

6/14/2025 3D FireOptimizer: Automating the Multi-Dimensional Tradeoffs in LLM Serving

Once you’ve launched your AI app, the next problem you often need to solve is maintaining quality while meeting the cost and latency bars needed to scale. However, there is no “one size fits all” approach to achieving optimal LLM performance – it depends heavily on your unique workload and the tradeoffs you make across the stack. With an explosion of choices across hardware, low-level optimizations, and model families, navigating this space on your own is more challenging than ever. At Fireworks, we help our customers find the sweet spot for their specific use case. We’ve previously published several blogs regarding tradeoffs in LLM serving and our approaches to them. Now, we’re excited to announce a new toolkit in our FireOptimizer tuning stack: 3D FireOptimizer. 3D FireOptimizer automatically searches through thousands of options to find configurations that achieve the optimal…

315dInfra#inference

316d ago

6/13/2025 Introducing Supervised Fine Tuning V2

At Fireworks, we believe models and data are core assets for any company. If you're building a vertical product, owning both your data and your models is key to delivering a premium user experience and creating strong product differentiation. Data and models should form a self-improving loop: a better model powers a better product, a better product attracts more users, and more users generate more data to improve the model further. This is what we call the data flywheel. You likely already have strong GTM and engineering teams to accelerate growth. Fireworks can help you close the loop by turning your data into a high-quality, customized model, and potentially, an even better product. We're excited to unveil Supervised Fine Tuning V2, the next generation of our supervised fine-tuning service, designed to do just that. V2 is not just an upgrade…

316dInfra#inference

319d ago

10/6/2025 Deep-Dive into LLM Fine-Tuning

Fine-tuning large language models (LLMs) has become one of the most critical levers for adapting general-purpose models to enterprise-grade applications. Models like Kimi K2, Qwen 3, and DeepSeek v3 provide remarkable generalization, but they are rarely optimal for domain-specific use cases that demand precision, compliance, and verifiable outputs. Fine-tuning bridges this gap, and understanding its mechanics and decision-making framework is essential for AI engineers building production systems. What Fine-Tuning Really Means At its core, fine-tuning is about updating the weights of a pre-trained model using a smaller, specialized dataset. This stands in contrast to pre-training, where models are trained from scratch on trillions of tokens. Pre-training gives the model broad linguistic and world knowledge, but it lacks the rigor to excel in narrow domains such as oncology, law, or financial compliance. Fine-tuning allows us to steer these general capabilities toward…

319dInfra#fine-tuning#inference

324d ago

5/6/2025 Qwen 3 on Fireworks AI: Controllable Chain-of-Thought and Tool Calling at Frontier Scale

TL;DR - •Reasoning meets function calls. Qwen 3 now streams an explicit … trace and the exact JSON tool call in the same completion. - •Turbo or stealth—your choice. Flip reasoning_effort="none" (or use the /think / /no_think tags) to trade transparency for raw throughput on the fly. - •Mixture-of-Experts giant, pay-as-you-go. The 235 B-parameter / 22 B-active Qwen3-235B-A22B runs serverlessly on Fireworks. - •Drop-in OpenAI compatibility. Use the Fireworks endpoint with the official OpenAI client; everything else stays the same. Why this release matters ? Until now, open-source LLMs forced a choice: show the chain of thought or call tools deterministically. Qwen 3’s new architecture does both in one pass, and keeps the reasoning block segregated so downstream code can ignore or audit it at will. Pair that with a 128-expert MoE that only activates eight experts (≈22 B live…

324dInfra#qwen#inference#coding#open-source

331d ago

5/29/2025 Fireworks DevDay 2025 Wrapped

Yesterday (May 28th), we hosted our very first Fireworks DevDay, and what an incredible day it turned out to be. Set against the vibrant backdrop of San Francisco, we brought together some of the brightest minds in AI and hundreds of developers who are pushing the boundaries of what’s possible with open-source models. It wasn’t just an event, it was a celebration of progress, speed, and the collective ambition to reimagine what AI can do in production. Fireside Sessions Our keynote fireside chats were raw, insightful, and deeply inspiring. They weren’t just high-level vision talks, they were grounded in real engineering challenges and solutions. •Sarah Sachs (Head of AI Engineering, Notion) shared how her team is making fast, thoughtful decisions about model size and latency to create delightful user experiences in Notion AI. •Adarsh H. (Co-founder, CTO, Mercor) revealed how…

331dInfra#inference#coding#open-source

332d ago

5/28/2025 FireAttention V4: Industry-Leading Latency and Cost Efficiency with FP4

Today, we’re announcing we've achieved industry-leading speeds of >250 tokens/second on NVIDIA B200 GPUs using our latest FireAttention V4 inference engine. FireAttention V4 achieves top-tier latency, throughput and cost efficiency, as measured by independent benchmarks, by leveraging FP4 (and specifically NVFP4) as the optimal precision for Blackwell architecture, just as FP16 was for Ampere and FP8 for Hopper. B200 deployments using FireAttention V4 with FP4 are now available to enterprise customers who need the best latency, throughput and cost-efficiency. Contact us or reach out to your Fireworks representative. NVIDIA Blackwell architecture is the first GPU generation to enable hardware-native micro-scaling support. While it has many micro-scaling modes, namely NVFP4, MXFP4, MXFP6 and MXFP8 , NVFP4 is the most interesting option. The reason is that unlike MXFP6 and MXFP8 modes, it has 2x FLOPs throughput and needs ~1.5x-2x less memory reads.…

332dResearch#rag#inference#benchmark#gpu

339d ago

Demo 5/21/2025 Building an open-source Browser Agent on Fireworks AI

Imagine an AI that doesn't just respond to your questions but can actively navigate the web for you - clicking buttons, filling forms, extracting information, and making decisions just like you would. That's the promise of AI agents with browser control capabilities, and it's becoming a reality with tools like Fireworks AI BrowserUse. In this technical deep dive, we'll explore how large language models (LLMs) can be given the ability to "see" web content and take actions in real-time. We'll examine the architecture that makes this possible and show why Fireworks AI's inference capabilities are particularly well-suited for this challenging task. Despite the push toward structured APIs, browsers remain the most universal interface to the web's vast information and services. Here's why building agents that can control browsers matters: This makes browser automation the most robust approach to web interaction,…

339dInfra#agents#inference#open-source

341d ago

5/19/2025 Agentic AI Systems

AI is evolving from passive responders into proactive agents that can perceive, reason, and act autonomously. We’re witnessing the rise of agentic systems - AI that goes beyond generating text responses to planning, executing, and learning across complex, multi-step tasks. Unlike traditional models, which respond to prompts or follow hardcoded scripts, agentic AI systems possess a sense of initiative. They can independently interpret goals, decide next actions, and iteratively refine their behavior over time. The result? AI that behaves less like a static program and more like a self-directed assistant or collaborator. This transformation isn’t theoretical. Today’s agents can book meetings, debug code, orchestrate workflows, and even collaborate with other agents - all with minimal human intervention. It’s a shift that promises not just increased productivity, but a fundamentally different way to build software. At the core, agentic AI systems…

341dAgents#agents

352d ago

8/5/2025 Introducing OpenAI gpt-oss (20b & 120b)

This is a deep dive analysis of gpt-oss (20b & 120b), released by OpenAI on 5th Aug 2025. This blog explores its capabilities, technical architecture, benchmarks, and practical applications for developers. OpenAI is finally back to living up to its name of building “open models”. After GPT-2, this is the first set of open-source LLMs coming from OpenAI. OpenAI's new open-source models, gpt-oss-20b and gpt-oss-120b, are very strong reasoning models that excel at problem solving and tool calling. Both models support long context windows and adjustable reasoning levels. That makes them a great choice for agentic use cases. Try out the new OpenAI gpt-oss-120b & gpt-oss-20b on Fireworks AI! The following table is an evaluation across multiple benchmarks and reasoning levels for both the gpt-oss-20b and gpt-oss-120b The following table showcases the Main capabilities evaluations, where gpt-oss models are compared…

352dInfra#fine-tuning#inference#open-source

358d ago

2/5/2025 DeepSeek R1 Just Got Eyes with Fireworks AI Document Inlining

A smart reasoning LLM is good, but a smart reasoning VLM is better! So let’s give DeepSeek R1 eyes. We’re excited to demonstrate how DeepSeek R1, a state-of-the-art reasoning model from DeepSeek AI, can now process and reason over both text and image inputs using the Fireworks AI Document Inlining feature. This capability extends DeepSeek R1’s powerful reasoning to multimodal analysis, opening new avenues for research and application in AI. A Quick Recap of DeepSeek R1 DeepSeek R1 has been making waves in the AI research community, consistently performing at the top of industry benchmarks and rivaling even some of the most prominent closed-source models. DeepSeek R1, developed by DeepSeek AI, is a state-of-the-art reasoning model with a massive 671 billion parameter (671B) configuration. It has demonstrated top-tier performance across various benchmarks, positioning itself as a leading open-source alternative in…

358dResearch#inference#multimodal

362d ago

4/28/2025 Optimizing Llama 4 Maverick on Fireworks AI

Meta's Llama 4 Maverick is their initial natively-multimodal, Mixture-of-Experts (MoE) model. This model processes both text and images, directing tokens through specialized expert blocks. Notably, it features a significantly expanded context window of 1 million tokens, a 10x increase compared to other models. This advancement allows for keeping extensive code repositories, complete product specifications, or lengthy user conversations in its memory. Minutes after Meta published the weights, the model showed up in the Fireworks AI catalogue (accounts/fireworks/models/llama4-maverick-instruct-basic). Early adopters, including many of the edge-AI researchers who benchmarked the model, were already hitting the endpoint before most providers finished container builds. To enable superior performance of Llama 4 we leveraged multiple components of Fireworks Platform: The flexibility of the platform enabled Fireworks AI to be the first public Llama 4 API. Independent testing by Artificial Analysis on April 27, 2025, demonstrates…

362dResearch#llama#fine-tuning#inference#benchmark

378d ago

12/4/2025 Turn Your LLM into a Calibrated Classifier for $2

Large language models aren’t just for free-form generation. A very common production use case is classification, where the model must choose among a small number of classes - and, crucially, return a probability (confidence) for each class. In this post, we present a practical approach to tuning and serving LLMs for classification tasks. The method builds on existing model training and inference infrastructure and is grounded in a theoretical analysis of how class probabilities naturally emerge during training. These probabilities can then be leveraged for downstream applications - such as ranking and event prediction - or used directly as confidence estimates. In many real-world setups, the training can be done with a surprisingly small budget - often as cheap as a couple of dollars in compute. Large language models are trained to predict the next token - but many real-world…

378dInfra#fine-tuning#inference#open-source

382d ago

8/4/2025 Announcing Eval Protocol

Can I swap one model for another? It is a simple question with no consistent method for answering confidently. Divergences of one for or another are hard to trap. So we set out to provide that method, and today launched Eval Protocol, an OSS library and SDK for making model evaluations work like unit tests, and through CI/CD automation. Introducing Eval Protocol (EP) EP is an open protocol that standardizes how developers author evaluations for large language model (LLM) applications. EP provides a specification for writing evals and storing eval results that travel with developers from local model picking and prompt engineering, through production CI/CD, to automated fine-tuning and reinforcement learning for real-world use-cases- from simple markdown and JSON generation to complex customer service agents with tool calling. EP bridges the gap between quick wins and long-term customization. Developers can…

382dInfra#fine-tuning#inference#open-source

384d ago

6/4/2025 Building a High‑Quality Synthetic Data Pipeline for Supervised Fine‑Tuning

In modern AI workflows, access to large, well‑curated datasets is often the bottleneck for achieving production‑grade model performance. To address this, we developed an end‑to‑end system that automates synthetic data generation, quality control, and iterative fine‑tuning, delivering in hours what traditionally takes weeks. Below, we dive into the technical architecture, key components, and performance gains of this workflow. The pipeline is composed of five interlinked stages: At its core, the pipeline leverages large language models (LLMs) to orchestrate generation logic, apply dynamic constraints, and drive intelligent iteration through automated evaluation loops. Smart Defaults Once the YAML config is reviewed and optionally modified by the user, it is uploaded to the generation dashboard. This interface provides operational transparency into data generation progress, including: Generated data is streamed in JSON format, where each entry includes: Each row is stored with associated quality…

384dInfra#fine-tuning#inference#open-source

403d ago

3/18/2025 Fireworks AI Now Supports NVIDIA NIM Deployments for Blazing AI Inference

Today, we’re pleased to announce that Fireworks AI supports NVIDIA NIM microservices, part of the NVIDIA AI Enterprise software platform, making it faster and easier for enterprises to deploy AI models on Fireworks and innovate on their product experiences. Fireworks AI offers industry-leading speed, customization and cost efficiency for leading open source AI models like DeepSeek and Llama. NVIDIA NIM microservices offer a wide range of AI models for a range of modalities including embeddings, video, 3D and more. With today's announcement, you can load the NVIDIA NIM models, including the latest NVIDIA Llama Nemotron Reasoning models, on the Fireworks platform. Or you could run DeepSeek R1 or Llama 405B models on Fireworks to take full advantage of Fireworks' optimizations and platform offerings, while also running NeMo Guardrails NemoGuard models on Fireworks via NVIDIA NIM. Together, this allows enterprises to…

403dInfra#inference#gpu

403d ago

3/18/2025 Faster, more efficient DeepSeek on the Fireworks AI Developer Cloud

At Fireworks, our mission is to empower developers with the premier toolchain using open models, delivering transparency, steerability, control, privacy, low latency, and cost efficiency. As agentic products continue gaining widespread adoption, the speed and efficiency of advanced AI models like DeepSeek R1 have become critical factors for product differentiation. Staying ahead, we continuously push the boundaries of performance and cost-efficiency through innovations like our specialized version of FireAttention and a distributed inference engine tailored specifically for DeepSeek’s unique MLA, MTP, and wide MoE architecture. Today, we're thrilled to announce exciting new options for deploying DeepSeek on Hopper GPUs, enhancing both speed and throughput. Expect even more advancements as we soon bring Blackwell GPUs into production. 1. Ultra-Fast DeepSeek R1 These enhancements build on our extensive developer platform capabilities: 👉 Secure Hosting: DeepSeek hosted securely in the US and EU,…

403dTutorial#inference#coding

410d ago

Product 11/3/2025 40X Faster, and Smarter Outputs: How Vercel Turbocharged their Code Fixing Model with Open Models, Speculative Decoding and Reinforcement Fine Tuning on Fireworks

Vercel, a leading platform provider for full-stack web applications, partnered with Fireworks to solve a critical challenge for their AI code generation tool, v0: maximizing both output quality and inference speed at scale. The solution involved optimizing v0’s auto-fixer solution for customized workloads. By implementing advanced techniques, including Reinforcement Fine-Tuning (RFT) and speculative decoding, Fireworks delivered a massive step-change in performance and quality for Vercel. The result is a platform capable of achieving a 93% error-free generation rate and a 40X improvement in end-to-end latency for v0 users, setting a new benchmark for developer-facing AI tooling. Vercel's v0 composite model family is a specialized AI architecture designed to generate high-quality, error-free code for building fast, full-stack web applications. It's a powerful tool for developers because it addresses the limitations of other models by combining retrieval-augmented generation (RAG) for specialized knowledge,…

410dHardware#rag#fine-tuning#inference#coding

435d ago

2/14/2025 Enabling Function Calling in DeepSeek v3: Bridging the Gap Between Text and Action

Large language models (LLMs) have revolutionized natural language processing by generating impressive text based on massive pretraining and strategic alignment with user preferences during post training. However, their inherent limitation is that—while they excel at generating human-like language—they lack the ability to access or update real-world information on demand. This is where function (or tool) calling comes into play. By enabling LLMs to invoke external functions or APIs, we can dynamically extend their capabilities, making them not only great conversationalists but also powerful, interactive agents. We are thrilled to announce that Fireworks AI API now supports function calling on top of the latest generation DeepSeek V3 model. Function calling refers to the process by which an LLM detects that a user request requires external data or action and then produces a structured output (typically in JSON) that specifies which function…

435dInfra#inference

437d ago

12/2/2025 Unlock Advanced Reasoning with NVIDIA Nemotron Nano 2 Models on Fireworks AI

We're excited to collaborate with NVIDIA to bring their groundbreaking NVIDIA Nemotron Nano 2 9B models to the Fireworks AI platform. NVIDIA Nemotron is a family of open models, datasets, and technologies that unlock developers to build highly efficient and accurate specialized agents. The Nemotron models are trained from scratch by NVIDIA, designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. Nemotron Nano 2 is built on a hybrid Mamba-Transformer architecture that delivers expert-level reasoning with unprecedented efficiency. Deploying Nemotron models on Fireworks can unlock more powerful use cases for developers. In scientific research, Nemotron models acts as the ideal lab partner that can easily process dense papers, explain complex concepts, and rapidly generate new hypotheses. For Search…

437dResearch#inference#benchmark#gpu

439d ago

10/2/2025 Production-Ready AI Agents with Optimized Inference with AWS AgentCore

Fireworks AI now integrates with AWS AgentCore, enabling developers to deploy AI agents with optimized inference on secure, serverless AWS infrastructure. Build locally, deploy globally with enterprise-grade security and automatic scaling. AWS AgentCore Runtime provides serverless infrastructure purpose-built for AI agents. It solves the operational complexity of deploying dynamic AI agents at scale by offering: This eliminates the need to manage containers, orchestration, or scaling infrastructure while maintaining enterprise security requirements. Fireworks AI delivers the fastest, highest quality inference engine for agentic workloads. We provide optimizations like adaptive caching and speculative decoding that are critical for multi-turn agent interactions. Combined with AgentCore's serverless deployment, you get: To demonstrate this integration, we built two cookbooks using AgentCore Runtime and AgentCore Code Interpreter. These agents can read files, generate python code, run the code and interpret the results. The agents use state…

439dInfra#fine-tuning#inference#open-source

449d ago

1/31/2025 Distillation with Reasoning: Can DeepSeek R1 Teach Better Than Humans?

The recent release of DeepSeek R1 has taken the AI community by storm, offering performance on par with leading frontier models—such as OpenAI’s o1—at a fraction of the cost. Still, R1 can be expensive for use cases with high traffic or low latency requirements. DeepSeek R1’s strength lies in its explicit step-by-step reasoning. Before generating a final answer, it creates an internal “chain of thought” (CoT) to systematically reason through each problem. This process is a form of test-time computation, allowing the model to dynamically allocate more compute to complex problems. However, these extended reasoning sequences typically increase inference cost. Distillation is a method for transferring knowledge from a large, more powerful teacher model to a smaller, more cost-effective student model. According to the DeepSeek R1 paper, R1 is highly effective in this teacher role. Its detailed CoT sequences guide…

449dHardware

450d ago

1/30/2025 Mistral Small 3 Now Available on Fireworks: Faster, Lighter, and More Efficient

The latest open-weight model from Mistral—Mistral Small 3—is now live on Fireworks! Fireworks is excited to partner with Mistral to be an official launch partner for the model. With Apache 2.0 licensing, blazing-fast 150 TPS generation speeds, and a 32K context window, it’s a powerful choice for builders looking for low-latency, high-efficiency AI. Mistral Small 3 outperforms Llama 3.3 70B base on many pretraining benchmarks while being 3x faster on the same hardware. As the most knowledge-dense model in its class, it’s an excellent choice for: ✅ Conversational AI – Quick, accurate chatbot responses ✅ Function calling & automation – Low-latency execution for agentic workflows ✅ Fine-tuning & domain expertise – Ideal for specialized knowledge (legal, healthcare, finance) ✅ Local inference – Runs on an RTX 4090 or MacBook with 32GB RAM At Fireworks, we believe the future of AI…

450dInfra#mistral#inference

453d ago

1/27/2025 Beyond Supervised Fine Tuning: How Reinforcement Learning Empowers AI with Minimal Labels

DeepSeek R1 and DeepSeek R1-Zero are all the rage right now. While DeepSeek R1 is likely a more suitable choice for production, DeepSeek R1-Zero as an exploratory model has also sparked significant interest in the community. For those of you who haven’t read the DeepSeek R1 technical report, the DeepSeek R1-Zero is a model trained without any supervised training data using an algorithm called GRPO (Group Relative Policy Optimization), and it was able to self-evolve to solve complex problems through complex chain of thought. GRPO is a reinforcement learning algorithm that shares many similarities with the PPO (Proximal Policy Optimization) algorithm that OpenAI famously adopted in their very original GPT3 training. While PPO is effective, there are several downsides that make it harder to adopt in practice. To name a few: The GRPO algorithm originally introduced in the DeepSeekMath: Pushing…

453dInfra#fine-tuning#inference

456d ago

1/24/2025 DeepSeek R1: All you need to know 🐳

LLM research space is undergoing rapid evolution, with each new model pushing the boundaries of what machines can accomplish. DeepSeek R1, released on January 20, 2025, by DeepSeek, represents a significant leap in the realm of open-source reasoning models. With capabilities rivaling top proprietary solutions, DeepSeek R1 aims to make advanced reasoning, problem-solving, and real-time decision-making more accessible to researchers and developers across the globe. DeepSeek R1 is an open-source AI model that stands out for its reasoning-centric design. While many large language models excel at language understanding, DeepSeek R1 goes a step further by focusing on logical inference, mathematical problem-solving, and reflection capabilities—features that are often guarded behind closed-source APIs. Reasoning models are crucial for tasks where simple pattern recognition is insufficient. From complex mathematical proofs to high-stakes decision-making systems, the ability to reason about problems step-by-step can vastly…

456dTutorial#training

458d ago

1/22/2025 Real-time, performant code assistance: How Sourcegraph scaled with Fireworks AI

Introduction Sourcegraph helps enterprises industrialize software development with AI. They have empowered developers with enterprise-grade code search and analysis for over a decade at companies like Stripe, Uber, government organizations, and top US banks. With the increasing demand for AI-driven coding assistants, Sourcegraph’s AI is designed to enhance code understanding and developer productivity even further. To meet the high-performance demands of enterprise clients and the broader developer community, Sourcegraph sought a scalable, flexible, and cost-effective platform capable of integrating multiple Large Language Models (LLMs) while maintaining real-time performance. Company background - •Market position and audience: With 12 years in operation, Sourcegraph serves software engineering teams and leaders as the go-to provider for enterprise-grade code search across vast and complex repositories, accelerating how the biggest companies in the world build software. - •Core focus and adoption: Focused on delivering enterprise-grade code…

458dTutorial#inference#coding

470d ago

10/1/2025 Launching Fireworks for Startups Program!

Builders move fast. We’re here to help you move faster. Today, we’re launching Fireworks for Startups: a program that gives AI‑builders the platform, tools, and expert support to turn ambitious ideas into production‑grade products. Startups don’t have time to wrestle with infrastructure, managing the rapid pace of new model releases and rising costs. The real challenge is building innovative products that stand out. Differentiation comes from fine-tuning and customization, but those shouldn’t be out of reach. Fireworks enables the infrastructure, customization and cost hurdles so founders can focus on what matters most: developing unique products, getting to market faster, and scaling with confidence. The Fireworks for Startups program offers comprehensive support for startups building AI products. Our industry-leading serving and tuning platform makes it easy to scale AI with speed, efficiency, and reliability, while our world-class AI engineers give you…

470dInfra#fine-tuning#inference#open-source

472d ago

8/1/2025 Qwen3 Decoded: Choosing the Right Model For Your Task

With Thinking, Instruct, and Coder released simultaneously, confusion spiked. We stress-tested all three on your real workflows (same benchmarks as yesterday’s post) and found: •Qwen3 235B A22B Instruct beats o4 mini in reranking & classification (0.758 → 0.726 in live Fireworks traffic) •Qwen3 235B A22B Thinking 2507 dominates complex math (AIME25: 92.3 vs 81.5 – 11% jump) •Qwen3 Coder 480B A35B Instruct closes the gap with quality near GPT 4.1 (0.862 → 0.91 in live Fireworks traffic) Your surgical guide to deploying the right variant → TL;DR: Your Qwen3 Model Selection Guide Forget generic "better performance" claims. Here's exactly when to use which model based on verified testing: •Use Qwen3-Coder-480B-A35B-Instruct as a Full-Stack Web App Generator •Use Qwen3-235B-A22B-Thinking-2507 to solve advanced AIME math problems •Use Qwen3-235B-A22B-Instruct-2507 for Real-Time Customer Support Chat Response Generation Qwen3 Architecture and Benchmark Differences for Each…

472dTutorial#qwen#benchmark

472d ago

8/1/2025 Kimi K2: Deep Dive into model performance and use-cases

Kimi K2 excels in specialized real-world software engineering tasks, achieving a 65.8% score on the SWE-Bench Verified benchmark, surpassing GPT-4.1 (54.6%) and performing competitively with leading closed-source models like Claude Opus in long-range reasoning and autonomous tool-use scenarios. To fully leverage Kimi K2’s capabilities, experienced engineers are essential for customizing behaviors, integrating advanced tool chains, and establishing effective safety guardrails within its open architecture. Fireworks AI simplifies this process by expertly optimizing LLM deployments for speed, quality, and cost- letting you focus solely on building cutting-edge AI systems. At its core, Kimi K2 is a Mixture-of-Experts (MoE) Transformer, which means it has 384 specialized "experts"- sub-models trained with targeted skills. When processing a token, only about 8 of these experts (roughly 32 billion parameters) activate, dynamically routing the input to the most relevant skills in real time. This means the…

472dResearch#benchmark

478d ago

2/1/2025 From text to task: Constrained generation for structured extraction in R1

What is constrained generation and why is it useful Constrained generation is a technique in natural language processing (NLP) where language models are guided to produce text that adheres to specific predefined rules or structures. This approach is particularly useful in applications requiring structured outputs, such as generating code, creating formatted documents, or producing data in formats like JSON. By enforcing constraints during the text generation process, models can ensure outputs that are not only coherent but also conform to the desired structure, enhancing both the utility and reliability of the generated content. What we’ll cover - •How constrained generation works - •Guiding model token selection - •Constrained decoding for structured outputs - •Reasoning models and structured extraction - •The role of constrained generation in reasoning models - •Fireworks' JSON mode for reasoning models - •Examples of constrained generation in…

478dTutorial

493d ago

12/18/2024 DeepSeek V3 just got vision capabilities!

2024 has been the year of large multimodal models. From Stable LM 2 to DeepSeek v3, the growth in the model capabilities along with benchmark metrics has been unprecedented. Not only did we see exceptionally good task-specific models, critical reasoning also became a core component of model capabilities. The last model release of 2024- DeepSeek v3, caught a ton of eyeballs because of how it blew up on the benchmark leaderboard. Not only did DeepSeek v3 beat GPT4-o in all the benchmarks, but it did exceptionally well on coding tasks. It has state-of-the-art benchmarks, achieving impressive scores on various tests, including 87.1% on MMLU and 87.5% on BBH. Source: DeepSeek V3 HuggingFace Benchmark Source: DeepSeek V3 HuggingFace Benchmark DeepSeek v3 has positioned itself pretty strongly on the quadrant of being open-source, cost-efficient, and high-performance. They have used Multi-Token Prediction (MTP)…

493dTutorial#inference#multimodal

526d ago

11/15/2024 Fireworks f1: A breakthrough in complex reasoning with Compound AI

At Fireworks, we believe the future of AI is shifting to compound AI systems that combine specialized models and tools to achieve better performance, reliability and control, compared to a single model. However building compound AI systems is difficult and time-consuming, right from selecting and tuning different components to orchestrating how they work together. So earlier this year, we set out to simplify the process of building compound AI, with the goal of making compound AI as easy to use as prompting a model. Today, we’re releasing a first step in that direction. f1 is a compound AI model specialized in complex reasoning, that interweaves multiple open models at the inference layer. Early testing has shown reasoning capabilities that match or exceed many closed frontier models as well as the best open models. f1 enables developers to access the power…

526dInfra#inference

530d ago

11/11/2024 How Upwork and Fireworks deliver faster, smarter proposals for freelancers

Upwork, the world's largest work marketplace, connects businesses with independent talent across over 10,000 skills in categories like website & app development, creative & design, data science & analytics, among others. From startups to Fortune 100 firms, Upwork enables companies to collaborate with freelancers, boosting productivity and driving innovation through flexible, scalable workforce solutions. Founded in 2015, Upwork has grown to become a leader in the work economy, facilitating over $3.8 billion in talent earnings as of last year. With a global reach and a diverse array of skill categories, Upwork continually innovates to meet the evolving needs of both businesses and freelancers in the digital age. The challenge As Upwork scaled its AI initiatives and launched Uma, Upwork’s Mindful AI, they faced a critical challenge: how to deliver high-quality, instantaneous AI assistance in a complex, multi-model environment connecting millions…

530dInfra#rag#inference

534d ago

7/11/2024 Fireworks AI Raises $52M Series B to Lead Industry Shift to Compound AI Systems

We’re thrilled to announce our $52M Series B funding round led by Sequoia Capital, raising our valuation to $552M. Other investors in this round include NVIDIA, AMD, and MongoDB Ventures. Previous investors include Benchmark, Databricks Ventures, former Snowflake CEO Frank Slootman, former Meta COO Sheryl Sandberg, Airtable CEO Howie Liu, Scale AI CEO Alexandr Wang, as well as executives from LinkedIn, Confluent, Meta, and OnePassword. This new funding round brings the total capital raised by Fireworks AI to $77M. This investment will help us drive the industry shift to compound AI systems, expand our team, and enhance our platform, enabling developers to quickly move AI applications from prototype to production. Sequoia General Partner Sonya Huang shared with me “Fireworks AI is perfectly positioned to lead this industry shift. Their team's expertise in building high-performance inference stacks and innovative approach to…

534dInfra#inference

550d ago

10/22/2024 FLUX.1 on Fireworks: Fast, frugal, and flexible

In partnership with Black Forest Labs, Fireworks is excited to announce commercially-usable FLUX.1 [dev] and FLUX.1 [schnell] models on Fireworks: Fireworks has been committed to fast, production-ready image generation - through milestones like an exclusive launch of Stable Diffusion 3 and serving SDXL and Playground v2.5 with < 1 second generation times. Today, Fireworks is partnering with Black Forest Labs, the original creators of the Stable Diffusion image generation models, to offer FLUX.1 [dev] and FLUX.1 [schnell]. Fireworks’ platform is designed to bring AI applications from prototype to production usage. By default, FLUX.1 models used outside of Fireworks have restrictions on commercial usage, but Fireworks and Black Forest Labs’ partnership enables commercial usage of both models on the Fireworks platform. Fireworks offers: The FLUX models are two of the highest-quality image models available and Fireworks offer the most customizable, fastest,…

550dInfra#fine-tuning#inference

557d ago

10/15/2024 FireAttention V3: Enabling AMD as a viable alternative for GPU inference

This post is the continuation of our FireAttention blog series: FireAttention V1 and FireAttention V2 . This time we are going to focus on a different GPU hardware, namely AMD MI300 GPU. While spec-wise it looks quite superior to NVIDIA H100 GPU we never know how it’s going to perform in real-world LLM inference settings until we run benchmarks, which represent practical LLM usage. Fireworks has been using AMD MI300 hardware in production since the launch of LLaMA 405b. In this post we are going to go over the work which made it happen. FireAttention V3 is an AMD-specific implementation for Fireworks LLM. When measured on 8 MI300 GPUs vs other leading LLM implementations (NIM Containers on H100 and AMD vLLM on MI300) it achieves 1.4x improvement for the average RPS @ 8 secs metric for LLaMA 8B model and…

557dHardware#inference

558d ago

10/14/2024 Three projects, one platform: A developer's winning streak with Fireworks AI

When it comes to building with Fireworks AI, few developers can match Nehil Jain's track record. His latest triumph – securing second place at the E2B x Fireworks AI hackathon alongside Selvam Palanimalai – marks his third consecutive success using the platform. From automating release notes with LazyPMs to matching hackathon teams with KinConnect, and now ensuring documentation reliability with ProoferX, Nehil has consistently demonstrated the power and versatility of Fireworks AI. "The speed I get using Fireworks endpoints for Llama models is one of the key drivers for successful outcomes," explains Nehil, whose deep understanding of the platform has been crucial to his winning streak. "A combination of Llama for intelligence and Firefunction for tool calling, along with structured outputs, lets me build reliable AI pipelines." His latest project, ProoferX, developed with technical architecture expert Selvam Palanimalai, tackles one…

558dHardware#llama#fine-tuning#inference#coding

577d ago

9/25/2024 Partnering with Meta: Bringing Llama 3.2 to Fireworks for Fine-Tuning and Inference

We are excited to announce support for the newest additions to the Llama collection from Meta. With the addition of Llama 3.2, developers gain access to new tools that enable the creation of sophisticated multi-component AI systems that combine models, modalities, and external tools to deliver advanced real-world AI solutions. Llama 3.2: Seeing The World More Clearly (And Quickly) The release of Llama 3.2 1B, Llama 3.2 3B, Llama 3.2 11B Vision, and Llama 3.2 90B Vision models brings a range of text-only and multimodal models designed to enhance modular AI workflows. These models provide deep customization, allowing developers to tailor solutions and accelerate specific tasks in compound AI systems. Get started today on Fireworks: •Llama 3.2 1B (text-only): Ideal for retrieval and summarization tasks such as personal information management, multilingual knowledge retrieval, and rewriting tasks. •Llama 3.2 3B (text-only):…

577dInfra#llama#fine-tuning#inference

577d ago

9/25/2024 How Enterprises are using Multimodal Models in production with Fireworks

Enterprises process large amounts of unstructured data, including scanned text, tables, charts, and images. Multimodal models allow enterprises to extract information and insights from their data faster and more easily than ever, without needing large inhouse AI teams or managing complex ML infrastructure. Read on to learn how enterprises are deploying multimodal models with Fireworks in production use cases! 👉 Try Llama 3.2 11B, Llama 3.2 90B and other multimodal models on Fireworks. We’re also launching the ability to fine-tune multimodal models on Fireworks very soon! Fireworks worked with major healthcare and insurance companies to enable efficient, real-time processing and analysis of vast amounts of medical and insurance records to extract key data points and insights. The sheer volume and complexity of these records made it difficult to classify and extract data quickly and accurately. Fireworks helped customers fine-tune and…

577dInfra#inference#multimodal

584d ago

9/18/2024 Multi-LoRA: Personalize AI at scale and deliver the best experience for each customer and use case, with 100x cost-efficiency

At Fireworks, we know how important it is for you to deliver the best product experiences to your users. Today, we’re excited to spotlight Multi-LoRA, a FireOptimizer capability that customers have used to personalize AI at scale and deliver the best experience for each customer and use case. Why it matters: Personalized experiences are critical to driving greater usage, retention and customer satisfaction for your product. Before Multi-LoRA, if you had many users, user segments or use cases to personalize for, deploying hundreds of fine-tuned models on separate GPUs would be prohibitively expensive. With Multi-LoRA, you can now deliver personalized experiences across thousands of users and use cases, without scaling your costs! 🚀 Multi-LoRA benefits: Multi-LoRA is part of FireOptimizer, our adaptation engine designed to customize and enhance AI model performance for your unique use cases and workload. FireOptimizer capabilities…

584dModel#fine-tuning

590d ago

12/9/2024 20x faster Whisper than OpenAI - Fireworks audio transcribes 1 hour in 4 seconds

Today, Fireworks is thrilled to announce the beta release of our speech-to-text APIs that support the Whisper v3-large models. Use it free for the next 2 weeks! Key features include: Why it matters: Audio transcription and translation use cases are exploding in importance. Fireworks audio excels at real-world, production use cases. Fireworks’ speed enables unmatched user experiences and our complete feature slate makes it easy to get the best quality and production-readiness. The compound AI audio opportunity At Fireworks, we believe we’re entering a new era of multi-modal, audio-driven AI. Products like NotebookLM (and open variants built on Fireworks) demonstrate how audio and text AI can combine to create magical user experiences. Fireworks customers like Cresta are innovating to create audio-first assistants while other customers create audio-based language learning assistants, tutors or call summarizers. These combined audio and text experiences…

590dInfra#inference#multimodal#safety

603d ago

8/30/2024 FireOptimizer: Customizing latency and quality for your production inference workload

At Fireworks, we've always believed that off-the-shelf models need to be adapted to meet production-grade performance. Today, we’re excited to introduce FireOptimizer, our adaptation engine designed to customize and enhance AI model performance for your unique use cases and workload. We have launched a new FireOptimizer feature: adaptive speculative execution, which delivers up to 3x latency improvements by tailoring speculative execution to your specific data and needs automatically. Why It Matters: In today’s world, every millisecond counts. Whether you’re powering real-time customer interactions, processing large-scale data for intelligent search, or using AI to generate code, FireOptimizer simplifies the complex tuning work for optimizing latency and quality, and ensures your models are not just fast, but customized to perform at their best for your unique scenario. The Benefits: Many developers are surprised by the extent that results can vary serving the…

603dInfra#inference