Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

For any given hardware, there is a fundamental lower bound on inference latency. This "latency floor" is the time required to load the model's total parameters from memory (e.g., HBM) onto the chip. This process cannot be sped up by reducing batch size or other software tricks.

Related Insights

A "roofline analysis" reveals that LLM performance is limited by the slower of two factors: the time it takes to fetch model parameters from memory (memory-bound) or the time it takes to perform matrix multiplications (compute-bound). Optimizing performance requires identifying and addressing the correct bottleneck.

AI workloads are limited by memory bandwidth, not capacity. While commodity DRAM offers more bits per wafer, its bandwidth is over an order of magnitude lower than specialized HBM. This speed difference would starve the GPU's compute cores, making the extra capacity useless and creating a massive performance bottleneck.

The necessity of batching stems from a fundamental hardware reality: moving data is far more energy-intensive than computing with it. A single parameter's journey from on-chip SRAM to the multiplier can cost 1000x more energy than the multiplication itself. Batching amortizes this high data movement cost over many computations.

APIs charge less for input prompts (prefill) than for generating responses (decode). This is because prefill processes many tokens at once, becoming compute-bound. Decode generates tokens one-by-one, making each step dominated by the high, unamortized cost of memory access. The price difference reflects this efficiency gap.

Spreading a model's layers across multiple GPU racks (pipeline parallelism) is a strategy to overcome memory capacity limits on a single rack. However, for inference, it offers no latency improvement; the total time remains the same. Its sole benefit is in memory capacity management for enormous models.

At shorter context lengths, LLM cost is dominated by compute. As context grows, fetching the KV cache from memory becomes the bottleneck. A pricing tier that increases cost above a certain context length (e.g., 200k tokens) indicates the approximate point where the system becomes memory-bandwidth limited and thus less efficient.

The ideal batch size that balances memory-bound and compute-bound operations can be calculated by a simple formula. It's roughly 300 (a hardware constant for modern GPUs) multiplied by the model's sparsity (total parameters / active parameters), providing a practical starting point for performance optimization.

While many focus on compute metrics like FLOPS, the primary bottleneck for large AI models is memory bandwidth—the speed of loading weights into the GPU. This single metric is a better indicator of real-world performance from one GPU generation to the next than raw compute power.

API providers offer faster inference at a premium by reducing the number of users processed simultaneously (batch size). This lowers latency but makes each token more expensive because the fixed cost of loading model weights is spread across fewer requests, reducing amortization.

High-throughput GPU clusters process batches on a fixed interval (e.g., every 20ms), like a train schedule. This interval is determined by the time it takes to "drain" the GPU's HBM. Requests are queued for the next departure, and the train leaves even if not full, which determines queueing latency.