Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

While reinforcement learning (RL) improves model capabilities, it often results in unpredictable, "bursty" computational demands during inference. This complicates serving the model efficiently, as infrastructure must be provisioned for costly peak loads.

Related Insights

The "Bitter Lesson" is not just about using more compute, but leveraging it scalably. Current LLMs are inefficient because they only learn during a discrete training phase, not during deployment where most computation occurs. This reliance on a special, data-intensive training period is not a scalable use of computational resources.

Unlike simple classification (one pass), generative AI performs recursive inference. Each new token (word, pixel) requires a full pass through the model, turning a single prompt into a series of demanding computations. This makes inference a major, ongoing driver of GPU demand, rivaling training.

The compute power required for AI agents to operate ('inference') is a significant new cost. Without an optimized infrastructure to manage these costs, companies risk spending all their AI-driven productivity gains on 'feeding' their digital workers, making the initiative unprofitable.

While GPUs train models, CPUs are essential for two key workloads: running reinforcement learning environments and executing the code generated by AI. This has created a massive, often overlooked demand spike, making CPUs a critical, sold-out component in the AI infrastructure stack and a hidden bottleneck.

AI workloads, particularly for research and evals, don't follow predictable "follow-the-sun" patterns. They are extremely spiky, demanding massive compute resources instantly (e.g., 100,000 CPUs) and then dropping to zero. This forces providers like Daytona to maintain low mean utilization (15%) to handle unpredictable peaks.

RL models can be inefficient during inference. The GPU often sits idle while the CPU calculates rewards, then suddenly gets hit with a massive "burst" of activity. This unpredictable demand makes serving these models costly and complex, requiring conservative GPU allocation.

Unlike pre-training's simpler data pipeline, RL involves many "moving parts" because each task can have a unique grading setup and infrastructure. This complexity, not just the algorithm itself, is the primary challenge for researchers managing live training runs at scale.

Contrary to the idea that infrastructure problems get commoditized, AI inference is growing more complex. This is driven by three factors: (1) increasing model scale (multi-trillion parameters), (2) greater diversity in model architectures and hardware, and (3) the shift to agentic systems that require managing long-lived, unpredictable state.

Previously, the biggest constraint in AI was compute for training next-gen models. Now, the critical bottleneck is providing enough compute for *inference*—the real-time processing of queries from a rapidly growing user base.

Unlike traditional computing where inputs were standardized, LLMs handle requests of varying lengths and produce outputs of non-deterministic duration. This unpredictability creates massive scheduling and memory management challenges on GPUs that were not designed for such chaotic, real-time workloads.