★ TOP STORY[ AOA( ]Open Source·7d ago

My Workflow for Understanding LLM Architectures

My Workflow for Understanding LLM Architectures A learning-oriented workflow for understanding new open-weight model releases Many people asked me over the past months to share my workflow for how I come up with the LLM architecture sketches and drawings in my articles, talks, and the LLM-Gallery. So I thought it would be useful to document the process I usually follow. The short version is that I usually start with the official technical reports, but these days, papers are often less detailed than they used to be, especially for most open-weight models from industry labs. The good part is that if the weights are shared on the Hugging Face Model Hub and the model is supported in the Python transformers library, we can usually inspect the config file and the reference implementation directly to get more information about the architecture details.…

Ahead of AI (Sebastian Raschka)read →

▲ trending · last 48hview all →

▾[AOA(]Ahead of AI (Sebastian Raschka)· 12 articlesvisit →

21d ago

Components of A Coding Agent

Components of A Coding Agent How coding agents use tools, memory, and repo context to make LLMs work better in practice In this article, I want to cover the overall design of coding agents and agent harnesses: what they are, how they work, and how the different pieces fit together in practice. Readers of my Build a Large Language Model (From Scratch) and Build a Large Reasoning Model (From Scratch) books often ask about agents, so I thought it would be useful to write a reference I can point to. More generally, agents have become an important topic because much of the recent progress in practical LLM systems is not just about better models, but about how we use them. In many real-world applications, the surrounding system, such as tool use, context management, and memory, plays as much of a…

21dAgents#agents#codingby Sebastian Raschka, PhD

34d ago

A Visual Guide to Attention Variants in Modern LLMs

A Visual Guide to Attention Variants in Modern LLMs From MHA and GQA to MLA, sparse attention, and hybrid architectures I had originally planned to write about DeepSeek V4. Since it still hasn’t been released, I used the time to work on something that had been on my list for a while, namely, collecting, organizing, and refining the different LLM architectures I have covered over the past few years. So, over the last two weeks, I turned that effort into an LLM architecture gallery (with 45 entries at the time of this writing), which combines material from earlier articles with several important architectures I had not documented yet. Each entry comes with a visual model card, and I plan to keep the gallery updated regularly. You can find the gallery here: https://sebastianraschka.com/llm-architecture-gallery/ After I shared the initial version, a few…

34dTutorialby Sebastian Raschka, PhD

59d ago

A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026

A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026 A Round Up And Comparison of 10 Open-Weight LLM Releases in Spring 2026 If you have struggled a bit to keep up with open-weight model releases this month, this article should catch you up on the main themes. In this article, I will walk you through the ten main releases in chronological order, with a focus on the architecture similarities and differences: Arcee AI’s Trinity Large (Jan 27, 2026) Moonshot AI’s Kimi K2.5 (Jan 27, 2026) StepFun Step 3.5 Flash (Feb 1, 2026) Qwen3-Coder-Next (Feb 3, 2026) z.AI’s GLM-5 (Feb 12, 2026) MiniMax M2.5 (Feb 12, 2026) Nanbeige 4.1 3B (Feb 13, 2026) Qwen 3.5 (Feb 15, 2026) Ant Group’s Ling 2.5 1T & Ring 2.5 1T (Feb 16, 2026) Cohere’s Tiny Aya (Feb 17, 2026) Update 1:…

59dOpen Sourceby Sebastian Raschka, PhD

91d ago

Categories of Inference-Time Scaling for Improved LLM Reasoning

Categories of Inference-Time Scaling for Improved LLM Reasoning And an Overview of Recent Inference-Scaling Papers (Including Recursive Language Models) Inference scaling has become one of the most effective ways to improve answer quality and accuracy in deployed LLMs. The idea is straightforward. If we are willing to spend a bit more compute, and more time at inference time (when we use the model to generate text), we can get the model to produce better answers. Every major LLM provider relies on some flavor of inference-time scaling today. And the academic literature around these methods has grown a lot, too. Back in March, I wrote an overview of the inference scaling landscape and summarized some of the early techniques. In this article, I want to take that earlier discussion a step further, group the different approaches into clearer categories, and highlight…

91dResearch#inferenceby Sebastian Raschka, PhD

116d ago

The State Of LLMs 2025: Progress, Problems, and Predictions

The State Of LLMs 2025: Progress, Problems, and Predictions As 2025 comes to a close, I want to look back at some of the year’s most important developments in large language models, reflect on the limitations and open problems that remain, and share a few thoughts on what might come next. As I tend to say every year, 2025 was a very eventful year for LLMs and AI, and this year, there was no sign of progress saturating or slowing down. 1. The Year of Reasoning, RLVR, and GRPO There are many interesting topics I want to cover, but let’s start chronologically in January 2025. Scaling still worked, but it didn’t really change how LLMs behaved or felt in practice (the only exception to that was OpenAI’s freshly released o1, which added reasoning traces). So, when DeepSeek released their R1…

116dResearch#inference#benchmarkby Sebastian Raschka, PhD

116d ago

LLM Research Papers: The 2025 List (July to December)

LLM Research Papers: The 2025 List (July to December) In June, I shared a bonus article with my curated and bookmarked research paper lists to the paid subscribers who make this Substack possible. In a similar vein, as a thank-you to all the kind supporters, I have prepared a list below of the interesting research articles I bookmarked and categorized from July to December 2025. I skimmed over the abstracts of these papers but only read a very small fraction. However, I still like to keep collecting these organized lists as I often go back to them when working on a given project. By the way, I was also working on my annual LLM review article, State of LLMs 2025: Progress, Problems, and Predictions, which I published today as well. You can find it here: Originally, I planned to include…

116dResearchby Sebastian Raschka, PhD

143d ago

From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates

From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates Understanding How DeepSeek's Flagship Open-Weight Models Evolved Last updated: January 1st, 2026 Similar to DeepSeek V3, the team released their new flagship model over a major US holiday weekend. Given DeepSeek V3.2’s really good performance (on GPT-5 and Gemini 3.0 Pro) level, and the fact that it’s also available as an open-weight model, it’s definitely worth a closer look. I covered the predecessor, DeepSeek V3, at the very beginning of my The Big LLM Architecture Comparison article, which I kept extending over the months as new architectures got released. Originally, as I just got back from Thanksgiving holidays with my family, I planned to “just” extend the article with this new DeepSeek V3.2 release by adding another section, but I then realized that there’s just too much interesting information…

143dOpen Sourceby Sebastian Raschka, PhD

172d ago

Beyond Standard LLMs

Beyond Standard LLMs Linear Attention Hybrids, Text Diffusion, Code World Models, and Small Recursive Transformers From DeepSeek R1 to MiniMax-M2, the largest and most capable open-weight LLMs today remain autoregressive decoder-style transformers, which are built on flavors of the original multi-head attention mechanism. However, we have also seen alternatives to standard LLMs popping up in recent years, from text diffusion models to the most recent linear attention hybrid architectures. Some of them are geared towards better efficiency, and others, like code world models, aim to improve modeling performance. After I shared my Big LLM Architecture Comparison a few months ago, which focused on the main transformer-based LLMs, I received a lot of questions with respect to what I think about alternative approaches. (I also recently gave a short talk about that at the PyTorch Conference 2025, where I also promised…

172dOpen Source#codingby Sebastian Raschka, PhD

202d ago

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch) Multiple-Choice Benchmarks, Verifiers, Leaderboards, and LLM Judges with Code Examples How do we actually evaluate LLMs? It’s a simple question, but one that tends to open up a much bigger discussion. When advising or collaborating on projects, one of the things I get asked most often is how to choose between different models and how to make sense of the evaluation results out there. (And, of course, how to measure progress when fine-tuning or developing our own.) Since this comes up so often, I thought it might be helpful to share a short overview of the main evaluation methods people use to compare LLMs. Of course, LLM evaluation is a very big topic that can’t be exhaustively covered in a single resource, but I think that having a clear mental…

202dResearch#coding#benchmarkby Sebastian Raschka, PhD

231d ago

Understanding and Implementing Qwen3 From Scratch

Understanding and Implementing Qwen3 From Scratch A Detailed Look at One of the Leading Open-Source LLMs Previously, I compared the most notable open-weight architectures of 2025 in The Big LLM Architecture Comparison. Then, I zoomed in and discussed the various architecture components in From GPT-2 to gpt-oss: Analyzing the Architectural Advances on a conceptual level. Since all good things come in threes, before covering some of the noteworthy research highlights of this summer, I wanted to now dive into these architectures hands-on, in code. By following along, you will understand how it actually works under the hood and gain building blocks you can adapt for your own experiments or projects. For this, I picked Qwen3 (initially released in May and updated in July) because it is one of the most widely liked and used open-weight model families as of this…

231dOpen Source#qwen#open-sourceby Sebastian Raschka, PhD

259d ago

From GPT-2 to gpt-oss: Analyzing the Architectural Advances

From GPT-2 to gpt-oss: Analyzing the Architectural Advances And How They Stack Up Against Qwen3 OpenAI just released their new open-weight LLMs this week: gpt-oss-120b and gpt-oss-20b, their first open-weight models since GPT-2 in 2019. And yes, thanks to some clever optimizations, they can run locally (but more about this later). This is the first time since GPT-2 that OpenAI has shared a large, fully open-weight model. Earlier GPT models showed how the transformer architecture scales. The 2022 ChatGPT release then made these models mainstream by demonstrating concrete usefulness for writing and knowledge (and later coding) tasks. Now they have shared some long-awaited weight model, and the architecture has some interesting details. I spent the past few days reading through the code and technical reports to summarize the most interesting details. (Just days after, OpenAI also announced GPT-5, which I…

259dModel#qwenby Sebastian Raschka, PhD

280d ago

The Big LLM Architecture Comparison

The Big LLM Architecture Comparison From DeepSeek V3 to GLM-5: A Look At Modern LLM Architecture Design Last updated: Apr 2, 2026 (added Gemma 4 in section 23) It has been seven years since the original GPT architecture was developed. At first glance, looking back at GPT-2 (2019) and forward to DeepSeek V3 and Llama 4 (2024-2025), one might be surprised at how structurally similar these models still are. Sure, positional embeddings have evolved from absolute to rotational (RoPE), Multi-Head Attention has largely given way to Grouped-Query Attention, and the more efficient SwiGLU has replaced activation functions like GELU. But beneath these minor refinements, have we truly seen groundbreaking changes, or are we simply polishing the same architectural foundations? Comparing LLMs to determine the key ingredients that contribute to their good (or not-so-good) performance is notoriously challenging: datasets, training techniques,…

280dModelby Sebastian Raschka, PhD