Contrary to the idea that infrastructure problems get commoditized, AI inference is growing more complex. This is driven by three factors: (1) increasing model scale (multi-trillion parameters), (2) greater diversity in model architectures and hardware, and (3) the shift to agentic systems that require managing long-lived, unpredictable state.
A 10x increase in compute may only yield a one-tier improvement in model performance. This appears inefficient but can be the difference between a useless "6-year-old" intelligence and a highly valuable "16-year-old" intelligence, unlocking entirely new economic applications.
The original playbook of simply scaling parameters and data is now obsolete. Top AI labs have pivoted to heavily designed post-training pipelines, retrieval, tool use, and agent training, acknowledging that raw scaling is insufficient to solve real-world problems.
Unlike simple classification (one pass), generative AI performs recursive inference. Each new token (word, pixel) requires a full pass through the model, turning a single prompt into a series of demanding computations. This makes inference a major, ongoing driver of GPU demand, rivaling training.
The future of AI is hard to predict because increasing a model's scale often produces 'emergent properties'—new capabilities that were not designed or anticipated. This means even experts are often surprised by what new, larger models can do, making the development path non-linear.
Unlike traditional SaaS, achieving product-market fit in AI is not enough for survival. The high and variable costs of model inference mean that as usage grows, companies can scale directly into unprofitability. This makes developing cost-efficient infrastructure a critical moat and survival strategy, not just an optimization.
The era of guaranteed progress by simply scaling up compute and data for pre-training is ending. With massive compute now available, the bottleneck is no longer resources but fundamental ideas. The AI field is re-entering a period where novel research, not just scaling existing recipes, will drive the next breakthroughs.
While the cost to achieve a fixed capability level (e.g., GPT-4 at launch) has dropped over 100x, overall enterprise spending is increasing. This paradox is explained by powerful multipliers: demand for frontier models, longer reasoning chains, and multi-step agentic workflows that consume exponentially more tokens.
AI's computational needs are not just from initial training. They compound exponentially due to post-training (reinforcement learning) and inference (multi-step reasoning), creating a much larger demand profile than previously understood and driving a billion-X increase in compute.
While the cost for GPT-4 level intelligence has dropped over 100x, total enterprise AI spend is rising. This is driven by multipliers: using larger frontier models for harder tasks, reasoning-heavy workflows that consume more tokens, and complex, multi-turn agentic systems.
Unlike traditional computing where inputs were standardized, LLMs handle requests of varying lengths and produce outputs of non-deterministic duration. This unpredictability creates massive scheduling and memory management challenges on GPUs that were not designed for such chaotic, real-time workloads.