★ TOP STORY[ VB ]Infra·11d ago

vLLM Korea Meetup 2026 Wrap-Up Apr 14, 2026 · 7 min read Hosted by the vLLM KR Community, with support from Rebellions, SqueezeBits, Red Hat APAC, and PyTorch Korea, the vLLM Korea Meetup 2026 was held in Seoul on April 2nd.

vLLM Korea Meetup 2026 Wrap-Up Hosted by the vLLM KR Community, with support from Rebellions, SqueezeBits, Red Hat APAC, and PyTorch Korea, the vLLM Korea Meetup 2026 was held in Seoul on April 2nd. This meetup proved to be much more than a standard tech event. Not only did it see strong turnout on the day, but the post-event survey recorded an impressive ~75% response rate — a testament to the active engagement of the attendees. Results reflected high overall satisfaction, confirming that the meetup delivered both in-depth practical content and a genuine community experience. Field engineers from a wide range of companies and research institutions gathered to share real-world deployment stories and infrastructure strategies for running LLMs in production. As AI moves beyond the research phase and into full-scale services, handling inference workloads efficiently has become a central challenge.…

vLLM Blogread →

▲ trending · last 48hview all →

▾[VB]vLLM Blog· 1 articlesvisit →

232d ago

Featured Inside vLLM: Anatomy of a High-Throughput LLM Inference System Sep 5, 2025 · 41 min read In this post, I'll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I'll be doing a breakdown...

Inside vLLM: Anatomy of a High-Throughput LLM Inference System Note: Originally posted on Aleksa Gordic's website. From paged attention, continuous batching, prefix caching, specdec, etc. to multi-GPU, multi-node dynamic serving at scale In this post, I'll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I'll be doing a breakdown of how vLLM [1] works. This post is the first in a series. It starts broad and then layers in detail (following an inverse-pyramid approach) so you can form an accurate high-level mental model of the complete system without drowning in minutiae. Later posts will dive into specific subsystems. This post is structured into five parts: - LLM engine & engine core: fundamentals of vLLM (scheduling, paged attention, continuous batching, etc.) - Advanced features: chunked prefill, prefix…

232dInfra#inference