vLLM Blog·Research·3d ago·~3 min read

vLLM Tops the Artificial Analysis Leaderboard May 11, 2026 · 15 min read How vLLM built the leading deployments of DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B.

vLLM Tops the Artificial Analysis Leaderboard How vLLM built the leading deployments of DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B. Last week, DigitalOcean published inference benchmarks across three frontier open-weight models. On DeepSeek V3.2, the deployment achieved a best per-user output throughput of 230 TPS — more than 4x what the majority of inference providers report for the same model. On Qwen 3.5 397B release, it ranked first across all 12 providers measured by Artificial Analysis, with TTFT under 1 second on 10,000-token prompts. The notable part: the engine underneath is open source. It's vLLM. A common assumption in production AI is that the best inference performance requires a proprietary stack. In this case, however, a community-built inference engine running on the same NVIDIA Blackwell Ultra silicon ranked first. The optimizations behind these results are not locked in a private fork. Op fusions for DeepSeek V3.2, a custom EAGLE3 draft model for MiniMax-M2.5, and a set of fusions tuned to Qwen 3.5's linear-attention path; every change is in vLLM main or in flight to be added. This post is about how this deployment was built. How vLLM made it fast The work split across three models, each with its own bottleneck and its own fix. - DeepSeek V3.2: aggressive kernel fusion to cut overhead at low batch sizes (also applicable to DeepSeek V4). - MiniMax-M2.5: targeted kernel fusion paired with a custom EAGLE3 draft model — trained on open-source TorchSpec and vLLM, even though the model itself is custom. The same draft works on M2.7; the architectures are identical. - Qwen 3.5 397B: targeted fusions for the model's attention and normalization path. The following sections walk through each model in turn. DeepSeek V3.2: Kernel Fusion at Low Batch Sizes At low batch sizes, DeepSeek V3.2 was bound by GPU kernel launch overhead, not compute. Each transformer layer was issuing dozens of separate kernels — small operations like normalization, rotary embedding, and quantization that the GPU itself executed in microseconds, but each carrying a fixed launch cost that dominated total time. The fix was op fusion across the attention path. Operations that previously launched as separate kernels — Q and KV normalization, rotary embedding for Q and KV, the indexer's layer norm and rotary embedding, FP8 quantization, and KV cache writes — collapsed into a pair of fused kernels covering everything outside attention and MoE. Per-layer kernel count dropped from ~33 toward a target of ~10. The fusion alone delivered a 1.28× speedup at batch size 1 (85.8 → 109.3 tok/s on 4× GB200, no MTP). On a single 8× B300 node at concurrency 1: - Without MTP (TP=8): 125 tok/s - With MTP=1 (TP=8): 234 tok/s (~90% draft acceptance rate) - With prefill/decode disaggregation (TP=4 + TP=4 + MTP=3): 262 tok/s Beyond fusion, two DSv3.2-specific kernels closed remaining gaps. A new router GEMM kernel — specialized for DSv3's MoE routing dimensions at small decode batch sizes — replaced the generic matmul and delivered an additional 6% speedup at batch 1…

#qwen#inference#benchmark

read full article on vLLM Blog →

0login to vote