/
© 2026 RiffOn. All rights reserved.

Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

  1. Dwarkesh Podcast
  2. Reiner Pope – The math behind how LLMs are trained and served
Reiner Pope – The math behind how LLMs are trained and served

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast · Apr 29, 2026

Reiner Pope explains the math behind LLM training & inference. Learn how batch size, memory, and compute dictate AI cost, latency, & progress.

LLM Inference Speed is Bottlenecked by Either Memory Bandwidth or Compute Throughput

A "roofline analysis" reveals that LLM performance is limited by the slower of two factors: the time it takes to fetch model parameters from memory (memory-bound) or the time it takes to perform matrix multiplications (compute-bound). Optimizing performance requires identifying and addressing the correct bottleneck.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago

LLM "Fast Modes" Achieve Speed by Using Smaller, More Expensive Batches

API providers offer faster inference at a premium by reducing the number of users processed simultaneously (batch size). This lowers latency but makes each token more expensive because the fixed cost of loading model weights is spread across fewer requests, reducing amortization.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago

Larger GPU Scale-Up Domains Reduce Latency by Aggregating Memory Bandwidth

The key advantage of larger GPU clusters is their ability to use the memory bandwidth of all GPUs in parallel to load model weights. This massive aggregate bandwidth dramatically reduces memory fetch times, which is a primary latency bottleneck, especially for very large, sparse models.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago

GPU Rack Interconnect Size is Physically Limited by Cable Density and Cooling

Increasing the number of GPUs in a high-speed "scale-up" domain is a physical engineering challenge. It's constrained by the sheer density of cables that can fit within a rack's backplane, along with factors like cable bend radius, power delivery, cooling capacity, and structural weight.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago

Optimal LLM Batch Size is ~300 Times the Model's Sparsity Factor

The ideal batch size that balances memory-bound and compute-bound operations can be calculated by a simple formula. It's roughly 300 (a hardware constant for modern GPUs) multiplied by the model's sparsity (total parameters / active parameters), providing a practical starting point for performance optimization.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago

Pipeline Parallelism Fits Large Models in Memory But Offers No Inference Latency Benefit

Spreading a model's layers across multiple GPU racks (pipeline parallelism) is a strategy to overcome memory capacity limits on a single rack. However, for inference, it offers no latency improvement; the total time remains the same. Its sole benefit is in memory capacity management for enormous models.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago

Minimum LLM Latency is Dictated by the Time to Read All Model Parameters from Memory

For any given hardware, there is a fundamental lower bound on inference latency. This "latency floor" is the time required to load the model's total parameters from memory (e.g., HBM) onto the chip. This process cannot be sped up by reducing batch size or other software tricks.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago

LLM Inference GPUs Operate on a Fixed "Train Schedule" (e.g., Every 20ms)

High-throughput GPU clusters process batches on a fixed interval (e.g., every 20ms), like a train schedule. This interval is determined by the time it takes to "drain" the GPU's HBM. Requests are queued for the next departure, and the train leaves even if not full, which determines queueing latency.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago

Cryptographic "Feistel Ciphers" Help Train Neural Networks by Reducing Memory Usage

A technique from cryptography, the Feistel network, makes any function invertible. When applied to neural network layers ("RevNets"), it allows activations from the forward pass to be re-calculated during the backward pass instead of stored. This trades extra compute for a massive reduction in memory footprint during training.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago

Cheaper Prefill Pricing Reveals LLMs are Memory-Bound During Single-Token Decode

APIs charge less for input prompts (prefill) than for generating responses (decode). This is because prefill processes many tokens at once, becoming compute-bound. Decode generates tokens one-by-one, making each step dominated by the high, unamortized cost of memory access. The price difference reflects this efficiency gap.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago

A Single GPU Rack's Interconnect Defines the Practical Size Limit for an MoE Layer

Mixture-of-Experts (MoE) models require an "all-to-all" communication pattern. This is efficient within a single GPU rack's high-speed interconnect but becomes a major bottleneck between racks, where communication is ~8x slower. This effectively limits an MoE layer's maximum size to what a single rack can support.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago

Production LLMs Are "Over-trained" by 100x vs. Chinchilla Laws to Optimize for Inference Cost

The Chinchilla scaling law optimizes pre-training compute alone. However, production models must also account for inference costs. By training smaller models on much more data (~100x the Chinchilla optimum), labs create models that are cheaper to run for users, effectively amortizing the higher training cost over the model's lifetime.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago

LLM Price Hikes for Long Contexts Signal a Shift from Compute to Memory Bottlenecks

At shorter context lengths, LLM cost is dominated by compute. As context grows, fetching the KV cache from memory becomes the bottleneck. A pricing tier that increases cost above a certain context length (e.g., 200k tokens) indicates the approximate point where the system becomes memory-bandwidth limited and thus less efficient.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago

A Production LLM's Compute Budget is Optimally Split 1/3 Pre-training, 1/3 RL, 1/3 Inference

To minimize the total cost for a certain level of performance, the compute budgets for a model's lifecycle stages should be balanced. A powerful heuristic is to equalize the costs: the compute spent on pre-training should roughly equal the compute for RL/fine-tuning, and also equal the total compute for user inference.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago

API Cache Pricing Tiers Expose the Underlying Hardware Memory Hierarchy (HBM vs. DDR vs. Flash)

When an API offers different pricing for caching context for various durations (e.g., 5 minutes vs. 1 hour), it is likely offering storage in different physical memory tiers. The shortest, most expensive tier is likely fast HBM, while longer, cheaper tiers could be DDR memory, flash storage, or even spinning disk.

Reiner Pope – The math behind how LLMs are trained and served thumbnail

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·2 months ago