LLM Inference GPUs Operate on a Fixed "Train Schedule" (e.g., Every 20ms)

Related Insights

LLM Inference Speed is Bottlenecked by Either Memory Bandwidth or Compute Throughput

A "roofline analysis" reveals that LLM performance is limited by the slower of two factors: the time it takes to fetch model parameters from memory (memory-bound) or the time it takes to perform matrix multiplications (compute-bound). Optimizing performance requires identifying and addressing the correct bottleneck.

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·7 hours ago

AI Accelerators Need High-Bandwidth HBM Memory; Slower Commodity DRAM Is a Bottleneck

AI workloads are limited by memory bandwidth, not capacity. While commodity DRAM offers more bits per wafer, its bandwidth is over an order of magnitude lower than specialized HBM. This speed difference would starve the GPU's compute cores, making the extra capacity useless and creating a massive performance bottleneck.

Dylan Patel — Deep Dive on the 3 Big Bottlenecks to Scaling AI Compute

Dwarkesh Podcast·2 months ago

Generative AI's Recursive Nature Makes Inference as Compute-Intensive as Training

Unlike simple classification (one pass), generative AI performs recursive inference. Each new token (word, pixel) requires a full pass through the model, turning a single prompt into a series of demanding computations. This makes inference a major, ongoing driver of GPU demand, rivaling training.

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

Training Data·6 months ago

Batching in AI Inference is Driven by Energy Costs, Not Just Compute Throughput

The necessity of batching stems from a fundamental hardware reality: moving data is far more energy-intensive than computing with it. A single parameter's journey from on-chip SRAM to the multiplier can cost 1000x more energy than the multiplication itself. Batching amortizes this high data movement cost over many computations.

Owning the AI Pareto Frontier — Jeff Dean

Latent Space: The AI Engineer Podcast·3 months ago

Modern AI Inference Systems Disaggregate 'Prefill' and 'Decode' Phases for Major Efficiency Gains

Top inference frameworks separate the prefill stage (ingesting the prompt, often compute-bound) from the decode stage (generating tokens, often memory-bound). This disaggregation allows for specialized hardware pools and scheduling for each phase, boosting overall efficiency and throughput.

NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

Latent Space: The AI Engineer Podcast·2 months ago

Minimum LLM Latency is Dictated by the Time to Read All Model Parameters from Memory

For any given hardware, there is a fundamental lower bound on inference latency. This "latency floor" is the time required to load the model's total parameters from memory (e.g., HBM) onto the chip. This process cannot be sped up by reducing batch size or other software tricks.

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·7 hours ago

LLM Serving Requires Continuous Batching for Non-Deterministic Requests

Traditional ML used "micro-batching" by normalizing inputs to the same size. LLMs break this model due to variable input/output lengths. The core innovation is continuous processing, handling one token at a time across all active requests, which creates complex scheduling and memory challenges solved by techniques like PagedAttention.

Inferact: Building the Infrastructure That Runs Modern AI

The a16z Show·3 months ago

Consistent, Low-Jitter Network Latency is More Critical Than Peak Speed for Large AI Clusters

When splitting jobs across thousands of GPUs, inconsistent communication times (jitter) create bottlenecks, forcing the use of fewer GPUs. A network with predictable, uniform latency enables far greater parallelization and overall cluster efficiency, making it more important than raw 'hero number' bandwidth.

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

Training Data·6 months ago

LLM "Fast Modes" Achieve Speed by Using Smaller, More Expensive Batches

API providers offer faster inference at a premium by reducing the number of users processed simultaneously (batch size). This lowers latency but makes each token more expensive because the fixed cost of loading model weights is spread across fewer requests, reducing amortization.

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·7 hours ago

LLM Inference Broke the Predictable Computing Paradigm with Dynamic Workloads

Unlike traditional computing where inputs were standardized, LLMs handle requests of varying lengths and produce outputs of non-deterministic duration. This unpredictability creates massive scheduling and memory management challenges on GPUs that were not designed for such chaotic, real-time workloads.

Inferact: Building the Infrastructure That Runs Modern AI

The a16z Show·3 months ago

Get your free personalized podcast brief

Related Insights