★ TOP STORY[ GB ]Infra·16d ago

Canopy Labs’ Orpheus TTS is live on GroqCloud

Canopy Labs’ Orpheus TTS is live on GroqCloud In December, we announced support for Canopy Labs’ Orpheus text-to-speech (TTS) on GroqCloud, with two model variants built for real-time, high-quality voices: - English TTS: canopylabs/orpheus-v1-english (with vocal directions) - Saudi Arabic (dialect) TTS: canopylabs/orpheus-arabic-saudi (authentic pronunciation + regional nuance) Today, we’re excited to announce a new release of the Saudi Arabic Orpheus TTS model on GroqCloud (canopylabs/orpheus-arabic-saudi). This release brings overall model improvements, including reduced hallucinations, more natural and expressive speech, and more accurate handling of numbers and symbols. It also introduces two new Saudi Arabic voices designed to sound more natural, culturally grounded, and production-ready. - Abdullah — A professional, calm, and conversational male voice, ideal for assistants, enterprise workflows, and general voice interfaces. - Aisha — A professional, clear, and approachable female voice, especially effective for customer support and…

Groq Blogread →

▲ trending · last 48hview all →

▾[GB]Groq Blog· 23 articlesvisit →

68d ago

GroqCloud: Expanding to Meet Demand

GroqCloud: Expanding to Meet Demand Demand for high-performance AI inference is accelerating globally, driven by real-time applications moving from experimentation into production. As this shift takes hold, infrastructure that delivers predictable performance, low latency, and efficient scale is becoming increasingly critical. At Groq, our architecture, roadmap, and customer commitments remain Groq-led. At the same time, GroqCloud adoption continues to support our planned global infrastructure expansion, enabling reliable inference deployments for developers and enterprises wherever they operate. Scaling GroqCloud for Production Workloads As interest in inference-optimized infrastructure continues to rise, GroqCloud has seen record levels of developers—now exceeding 3.5 million—along with sustained increases in production traffic. Teams across industries are using GroqCloud to power real-time applications where consistency, determinism, and cost efficiency are non-negotiable. To support this momentum, Groq is continuing to scale GroqCloud’s global availability. New UK Data Center Expands…

68dInfra#inference#coding

130d ago

Advancing the American AI Stack

Advancing the American AI Stack Introduction Power has always flowed from the control of the world's essential resources. Once it was steel, then oil, then data. Today, it is AI compute, and specifically, the ability to run AI systems efficiently at global scale. Whoever controls AI compute will shape the century ahead. Compute is fast becoming the foundation of global economic growth. In the United States, investment in AI infrastructure—from data centers to semiconductors and energy systems—is already moving the needle: J.P. Morgan estimates that data-center spending alone could boost U.S. GDP by up to 20 basis points over the next two yearsFootnote 1. According to The Economist, investments tied to AI now account for 40 percent of America's GDP growth over the past year, equal to the amount contributed by consumer spending growth. That statistic would be staggering regardless…

130dInfra#inference#coding

145d ago

Groq Recognized in 2025 Gartner® Cool Vendor in AI Infrastructure report

Groq Recognized in 2025 Gartner® Cool Vendor in AI Infrastructure report The next era of AI is here, one defined by fast, intelligent inference that scales as far as the world needs. Groq has been recognized as a 2025 Gartner Cool Vendor in AI Infrastructure. We believe this demonstrates the unique advantages LPUs deliver for real-time AI systems compared to traditional GPU architectures. The Gartner Cool Vendors report notes innovative infrastructure vendors that enable heads of infrastructure & operations to deploy AI more rapidly, optimize costs, and mitigate risks, resulting in more effective and future-ready AI initiatives. More than 2.5M developers choose Groq for performance that’s up to 5x faster and lower cost than GPU-based alternatives. This capability stems from the Groq LPU, a chip purpose-built for low-latency inference, which we deliver to developers worldwide with GroqCloud. Compared to GPU-based…

145dHardware#inference

151d ago

Introducing MCP Connectors in Beta on GroqCloud

Introducing MCP Connectors in Beta on GroqCloud Zero-setup tool use for Google Workspace: faster, easier, and more secure than ever. Today we’re excited to announce MCP Connectors in Beta on GroqCloud. MCP Connectors are Groq-maintained wrappers for external services (starting with Google Workspace: Gmail, Drive and Calendar). This release builds on top of GroqCloud’s existing remote MCP support, enabling your AI agents to interact with Google Workspace tools via the Responses API, without having to manage your own MCP server. Built for developers who want powerful AI agents without operational overhead, MCP Connectors deliver drop-in compatibility, zero infrastructure to manage, and Groq’s speed on our secure inference infrastructure. Why MCP Connectors Matter The MCP (Model Context Protocol) approach unlocks agentic AI workflows: models interacting with external tools, databases, and services to extend beyond pure text generation. Some external services provide…

151dInfra#inference

178d ago

Day Zero Support for OpenAI Open Safety Model

Day Zero Support for OpenAI Open Safety Model Fast and Affordable AI Inference For the World’s Latest Open Safety Model We’re excited to announce the availability of GPT-OSS-Safeguard-20B on GroqCloud, providing day zero, on‑demand access to OpenAI’s newest open‑source model running at over 1000 t/s. This first‑of‑its‑kind open‑weight reasoning model is purpose-built for safety‑classification workloads and lets you bring your own policy to production in minutes. What is GPT‑OSS‑Safeguard‑20B? - Fine‑tuned from OpenAI’s GPT‑OSS-20B - Built for safety use cases - Safety‑first reasoning – trained to follow explicit, user‑provided policies and to explain its decisions. In short, GPT-OSS-Safeguard-20B provides a reasoned classifier instead of a raw score, making debugging and compliance far easier. Core Features for Trust & Safety Teams Enabled on GroqCloud - Bring Your Own Policy: Load any taxonomy, definition set, or threshold you own. The model will…

178dInfra#inference#safety

185d ago

LLMs Inside the Product: A Practical Field Guide

LLMs Inside the Product: A Practical Field Guide Building with LLMs has taught me one clear lesson: the best AI feature is often invisible. When it works, the user doesn’t stop to think “that was AI.” They just click a button, get an answer quickly, and move on with their task. When it doesn’t work, you notice right away: the spinner takes too long, or the answer sounds confident but is not true. I’ve hit both of these walls many times. And each time, the fix was less about “smarter AI” and more about careful engineering choices. Use only the context you need. Ask for structured output. Keep randomness low when accuracy is important. Allow the system to say “I do not know.” This guide is not about big research ideas. It’s about practical steps any engineer can follow to…

185dTutorial#open-source

191d ago

GPT‑OSS Improvements: Prompt Caching & Lower Pricing

GPT‑OSS Improvements: Prompt Caching & Lower Pricing At Groq, we’re relentless about fueling developers with the best price‑performance for AI inference. Today we’re starting to roll out two updates for GPT-OSS models that make building at scale faster, cheaper, and simpler. New, Lower Prices for GPT‑OSS Models We’ve reduced the price for gpt-oss models on GroqCloud to ensure all developers can ignite their applications with increased cost efficiency. These new prices are effective today and will apply retroactively to all unpaid invoices for the month of October 2025. Prompt Caching on GPT-OSS Models We’re rolling out prompt caching on our GPT-OSS models. Last month we quietly rolled out prompt caching on GPT-OSS-20B , and we’ll be rolling it out on GPT-OSS-120B over the next few weeks. What this means for developers using these models: - Up to 50% discount on…

191dInfra#inference#coding

214d ago

Introducing Remote MCP Support in Beta on GroqCloud

Introducing Remote MCP Support in Beta on GroqCloud Connect to any tool. Share context seamlessly. All OpenAI compatible. Run faster at lower cost. Today we’re announcing that remote Model Context Protocol (MCP) server integration is available in Beta on GroqCloud, unlocking faster, lower-cost AI applications with tool capabilities through Anthropic’s open MCP standard. Developers can now connect any remote MCP server of their choice to models hosted on GroqCloud, allowing models to interact with external tools (GitHub, browsers, databases, and more) via the OpenAI-compatible Groq Responses API. Because our implementation is compatible with both the OpenAI Responses API and the OpenAI remote MCP specification, developers already running on OpenAI can switch to Groq with zero code changes and immediately benefit from Groq’s speed and predictable costs. Why Remote MCP Matters The Model Context Protocol is an open standard for connecting…

214dInfra#inference

233d ago

Introducing the Next Generation of Compound on GroqCloud

Introducing the Next Generation of Compound on GroqCloud Our agentic AI system, rolling out in general availability Today we’re announcing that Compound Beta , Groq’s first agent and compound AI system, is moving to general availability as Compound on GroqCloud. With Compound, developers can integrate agentic AI which can conduct research, execute code, control browsers, navigate the web on their behalf. Compound from Groq uniquely delivers leading quality and low latency at low cost. Now in general availability, developers can expect production-grade stability and increased rate limits. Since the launch of beta, we’ve seen more than 100k developers use Compound generating more than 5M requests and thousands of active customers. With their feedback, we have improved Compound, shifting the vision from tool-enabled models towards an agentic operator. New Version Available We’re also releasing a new version that is our smartest,…

233dResearch#agents#inference#coding

233d ago

Introducing Kimi K2‑0905 on GroqCloud

Introducing Kimi K2‑0905 on GroqCloud Moonshot AI’s cutting‑edge model, moonshotai/Kimi-K2-Instruct-0905 , is now live on GroqCloud. This integration brings day zero support for the latest frontier open model alongside production‑grade speed, low latency, and predictable cost empowering developers to take agentic coding to the next level. Key Features of Kimi K2‑0905 on GroqCloud - Full 256k Context Window: The largest context window of any model on GroqCloud to date. - Prompt Caching: Up to 50% cost savings on cached tokens and dramatically faster response times. When paired with the 256k context window, this is a massive unlock for agentic coding applications, where a large amount of context is shared between queries. - Leading Price‑to‑Performance: 200+ T/s at a blended price of $1.50 / M tokens ($1.00 / M input tokens; $3.00 / M output tokens), helping to provide top‑tier performance…

233dInfra#agents#inference

248d ago

Introducing Prompt Caching on GroqCloud

Introducing Prompt Caching on GroqCloud Fast, Low Cost, and Seamless AI Inference for Repetitive Workloads Prompt caching is rolling out on GroqCloud, starting with Kimi K2-Instruct. It works by reusing computations for prompts that start with the same prefix, so developers only pay full price for the differences. The result is a 50% cost savings on cached tokens and dramatically faster response times, with no code changes required. Ideal for chatbots, retrieval augmented generation, code assistants, and any workflow with stable, reusable prompt components, prompt caching works automatically on every API request, making your AI workflows faster and cheaper right out of the box. Why Prompt Caching Matters Instant Speed‑Ups - Reduced latency for any request that shares an identical token prefix with a recent request. 50% Token‑Cost Savings - All input tokens in the identical prefix get a 50%…

248dInfra#inference#coding

263d ago

Day Zero Support for OpenAI Open Models

Day Zero Support for OpenAI Open Models Fast and Affordable AI Inference For the World’s Most Popular Models We're excited to announce that GroqCloud now supports the much anticipated OpenAI open models, openai/gpt-oss-120B and openai/gpt-oss-20B ! This launch brings day-zero support for the latest open models, empowering developers worldwide to build innovative AI applications with unprecedented speed, scale, and production reliability. Full Model Capabilities To get the most out of OpenAI's open models, extended context and tools like code execution and browser search are essential. Groq's platform delivers these capabilities from day zero, with full support for 128K token context length and built-in tools such as code execution and browser search. This enables developers to build complex workflows, provide accurate and relevant information, and leverage real-time reasoning. In conjunction with the release of these models, we have also added a…

263dInfra#rag#inference

267d ago

Read Blog

Inside the LPU: Deconstructing Groq’s Speed Moonshot’s Kimi K2 recently launched in preview on GroqCloud and developers keep asking us: how is Groq running a 1-trillion-parameter model this fast? Legacy hardware forces a choice: faster inference with quality degradation, or accurate inference with unacceptable latency. This tradeoff exists because GPU architectures optimize for training workloads. The LPU–purpose-built hardware for inference–preserves quality while eliminating architectural bottlenecks which create latency in the first place. Accuracy Without Tradeoffs: TruePoint Numerics Traditional accelerators achieve speed through aggressive quantization, forcing models into INT8 or lower precision numerics that introduce cumulative errors throughout the computation pipeline and lead to loss of quality. We use TruePoint numerics, which changes this equation. TruePoint is an approach which reduces precision only in areas that do not reduce accuracy. Coupled with our LPU architecture, this allows for the preservation of…

267dTutorial#inference#coding

268d ago

OpenBench: Reproducible LLM Evals Made Easy

OpenBench: Reproducible LLM Evals Made Easy Evaluating large language models (LLMs) today is fundamentally broken. If you've spent any time with eval frameworks, you already know the drill: each one makes different decisions on how to prompt models, parse responses, and measure metrics like accuracy. Every lab has its own approach – pass@k, best-of-n, zero-shot, few-shot, CoT prompting… the list goes on. It’s subtle, but it means you can never truly compare numbers across different frameworks or model releases. Even when everything is meticulously documented, reproducing results is frustratingly hard. Tiny implementation quirks creep in everywhere. In practice, benchmark scores are basically irreproducible. We ran into this problem at Groq enough times to finally decide: okay, let's fix this once and for all. We built OpenBench internally, and it genuinely helped us. So now we're releasing it publicly because reliable…

268dInfra#inference#benchmark

313d ago

Build Faster with Groq + Hugging Face

Build Faster with Groq + Hugging Face Simplicity of Hugging Face + Efficiency of Groq Exciting news for developers and AI enthusiasts! Hugging Face is making it easier than ever to access Groq’s lightning-fast and efficient inference with the direct integration of Groq as a provider on the Hugging Face Playground and API. This means developers can now tap into Groq’s unparalleled efficiency when it comes to speed, cost, and production-level context windows all with unified access and billing through the Huggingface platform. Easy Access from Hugging Face Playground Simply select "Groq" as your provider, and your requests will be billed directly to your Hugging Face account at Groq's competitive pricing. For those who prefer to manage their Groq usage directly, you also have the option to add your own Groq API key. Seamless Integration with the Hugging Face API…

313dInfra#agents#inference

319d ago

GroqCloud™ Now Supports Qwen3 32B

GroqCloud™ Now Supports Qwen3 32B Delivering Fast Inference with the Full 131k Context Window GroqCloud™ now supports Qwen3 32B, a cutting-edge, dense 32.8 billion parameter causal language model from Alibaba's Qwen3 series. This integration brings the power of Qwen3 32B's advanced multilingual capabilities to GroqCloud, enabling businesses to leverage complex reasoning and efficient dialogue across 100+ languages and dialects in their applications. Groq Performance & Pricing With Qwen3 32B on GroqCloud, developers can run advanced reasoning and multilingual workloads while keeping cost and latency low. Furthermore, Groq is the only fast inference provider to enable the full 131k context window for this model allowing developers to build production level workloads, not just POCs. Groq is offering Qwen3 32B at an on-demand price of: $0.29 / M input tokens and $0.59 / M output tokens. Artificial Analysis has independently benchmarked Groq’s…

319dInfra#qwen#agents#inference

326d ago

LoRA Fine-Tune Support Now Live on GroqCloud

LoRA Fine-Tune Support Now Live on GroqCloud GroqCloud now supports Low-Rank Adaptation (LoRA) fine-tunes, exclusively by request, for our Enterprise tier customers. LoRA enables businesses to deploy adaptations of base models customized to their specific use cases on GroqCloud, offering a more efficient and cost-effective approach to model customization. As a part of this release, we are introducing the ability to serve multiple LoRAs at the same latency and speed as the base model on GroqCloud. This means customers can deploy LoRA fine-tuned models without requiring a full dedicated hardware instance. Real-World Applications of LoRA Fine-Tuning Phonely, a company focused on AI phone support agents, partnered with Maitai to enhance the speed and accuracy of their AI agents. By leveraging Maitai’s platform, which enables hotswapping of LoRA models that run on GroqCloud, they achieved significant improvements in real-time response performance…

326dInfra#fine-tuning#inference

333d ago

From Speed to Scale: How Groq Is Optimized for MoE & Other Large Models

From Speed to Scale: How Groq Is Optimized for MoE & Other Large Models You know Groq runs small models. But did you know we run large models including MoE uniquely well? Here’s why. The Evolution of Advanced Openly-Available LLMs There’s no argument that Artificial intelligence (AI) has exploded, in part because of the advancements in large language models (LLMs). These models have shown some amazing capabilities when it comes to natural language processing, from text generation to complex reasoning. As LLMs become even more sophisticated, one of the biggest challenges is scaling them efficiently. That’s where Groq comes in, a company at the forefront of AI hardware innovation, addressing this challenge with its groundbreaking LPU. In the past few years, the AI community has seen a surge in open-source LLMs, including models like Llama, DeepSeek, and Qwen. These models…

333dHardware#inference

344d ago

How to Build Your Own AI Research Agent with One Groq API Call

How to Build Your Own AI Research Agent with One Groq API Call Simplifying the Complexity of AI Agents with Server-Side Tool Use Large Language Models (LLMs) are powerful but constrained by static training data, lacking the ability to access real-time information or interact dynamically with external environments. We need real-time data, not snapshots from 2023. This limitation has driven the adoption of tool use (also known as function calling), which equips LLMs with tools or functions to fetch live data, execute code, and navigate complex tasks (including playing Pokémon). This evolution has led us to AI agents, which are systems that leverage LLMs and tools to autonomously interact with external environments. Neon recently reported that over 80% of their databases were created by AI agents, outpacing human contributions by 4x. There are even AI agent marketplaces now where we…

344dTutorial#agents#inference

361d ago

Official Llama API Now Fastest via Groq Inference

Official Llama API Now Fastest via Groq Inference The official Llama API is now accelerated by Groq. Served on the world’s most efficient inference chip, it’s the fastest way to run the world’s most trusted openly available models with no tradeoffs. In collaboration with Meta, a limited free preview is live now. What Is It? The official Llama API is an upcoming developer platform accelerated by Groq. It's not a wrapper. It’s not a copy. It’s the real thing, served directly from Meta and accelerated by Groq's purpose-built inference hardware. With Llama 4 and more available now, you can start building with zero setup. Why it Matters to Builders The Llama API, accelerated by Groq, offers several benefits to builders, including: - First-party access: You're using Meta models, served the way they were meant to be. Optimized, up-to-date, and fully…

361dInfra#llama#inference

375d ago

Now in Preview: Groq’s First Compound AI System

Now in Preview: Groq’s First Compound AI System Build with access to the internet and the ability to run code with a one line change to your model string. Compound Beta is Groq’s first compound AI system, released under preview on GroqCloud™. It combines openly available models already supported on our platform with built-in tool use, starting with web search and code execution, so developers can handle real-world queries in a single high-performance, low-cost API call. What Can This Do That an LLM Can’t? While LLMs are great at generating text, they are limited to what they were trained on. Compound Beta takes the next step. It is designed to solve problems by taking action, using tools like web search and code execution alongside powerful models. This allows the system to access real-time information, perform live computations, and interact with…

375dInfra#inference#coding

385d ago

Llama 4 Inference Fast & Affordable – Now Live on GroqCloud

Llama 4 Inference Fast & Affordable – Now Live on GroqCloud Meta’s Llama 4 Scout and Maverick models are live today on GroqCloud™, giving developers and enterprises day-zero access to the most advanced open-source AI models available. Today, Meta released the first models in the Llama 4 herd, which will enable people to build more personalized multimodal experiences. With Llama 4 Scout and Llama 4 Maverick available on GroqCloud today to its free users and paid customers, developers can run cutting-edge multimodal workloads while keeping costs low and latency predictable. https://youtu.be/Eq0rl6B1i5Y Groq Performance & Pricing Our vertically integrated GroqCloud and inference-first architecture deliver unmatched performance and price. With Llama 4 models, developers can run cutting-edge multimodal workloads while keeping costs low and latency predictable. Llama 4 Scout is currently running at over 460 tokens/s while Llama 4 Maverick is coming…

385dInfra#llama#inference

395d ago

Build Fast with Text-to-Speech AI – Dialog Model on Groq

Build Fast with Text-to-Speech AI – Dialog Model on Groq Groq & PlayAI partner to bring Dialog, a leading TTS model, to GroqCloud™ for real-time voice applications One of the most popular emerging applications for applied AI has been generative voice systems that converse with customers for services like customer support or appointment scheduling. In our world, the responsiveness and the emotional authenticity of the AI system are key to success. Delivering on these features requires fast AI inference and building conversational AI interfaces that people can actually use, and historically, there have been compromises between how well it works and how much it costs. That’s why Groq and PlayAI are partnering to leapfrog the world to the next generation of conversational human-AI interactions. https://youtu.be/9rbz937y6DE PlayAI is a leading provider of advanced Text-to-Speech (TTS) voice AI models based on LLMs,…

395dInfra#inference