MoE Math Demystified: What Does 8x7B Actually Mean? October 14, 2025
This video breaks down MoE inference arithmetic and deployment bottlenecks across different hardware setups. If you can’t open the video displayed above, please use this link to open it on YouTube: https://youtu.be/gHpDBoyCOrE What does 8x7B actually mean? You probably thought it meant 8 experts with 7B active parameters per token. We did too. Turns out it is actually 13B active parameters. But wait, where does 13B come from? This is exactly the kind of confusion this post clears up (skip to the answer). We'll explain what those numbers actually mean for inference by answering how much memory you need, how many GPUs, and what the commonly hit bottlenecks are in production deployment. We'll show that single-GPU deployment is memory-bound, multi-GPU setups are communication-bound, and specialized hardware like Cerebras WSE is compute-bound. Originally, we set out to write a simple post on MoE arithmetic. Then we kept digging. And digging. What started as basic math turned into a full explanation of inference bottlenecks, hardware architectures, and deployment strategies. Welcome to MoE inference 101. The title stayed, but the scope didn't. Up to this point, the Mixture-of-Experts (MoE) series has focused on training aspects. In part 3, you trained your own MoE model and scaled it in part 4 to production size. Now what? We shift to inference. During inference, model weights are frozen, no gradients or optimizer states. This sounds much simpler compared to training mode. But! Despite doing less work, MoE inference has its own challenges. Fun fact: people who train models rarely think about inference costs, and vice versa. Your authors are no exception, but we are trying to be better. So, if you're deploying MoE models, or you trained an MoE model and want to know what your design choices have led to, or you simply want to understand how to run MoE inference efficiently, keep scrolling. Table 1: Notation. How much memory do I need? Want to avoid hitting an OOM error during inference deployment? Let’s examine two components that dominate memory space: model weights and kv-cache. To simplify our calculations, we are going to use standard modern transformer setup. We use RoPE positional embeddings (Su et al., 2023), SwiGLU nonlinearity (Shazeer, 2020; Dauphin et al., 2016), layer norms, multi-head attention, untied embeddings, and industry standard learned routing (Soboleva, 2025a). Model weights Let's walk through the math. First, we'll calculate a single decoder block memory requirement, then account for all layers , and finally add the remaining network parameters not included in the decoder blocks (bottom to top). Bias terms are omitted from the following equations as they are negligible (regardless of the model size). Our MoE model consists of an embedding layer, followed by decoder blocks, and an unembedding layer. Each decoder block contains two layer norms, an attention layer, a router, and an MoE layer with expert networks (Figure 1). Figure 1: Visual breakdown of MoE model decoder architecture. The MoE model consists of an embedding layer, followed by decoder blocks, and an unembedding layer.…

