Unlike simple classification (one pass), generative AI performs recursive inference. Each new token (word, pixel) requires a full pass through the model, turning a single prompt into a series of demanding computations. This makes inference a major, ongoing driver of GPU demand, rivaling training.
The progression from early neural networks to today's massive models is fundamentally driven by the exponential increase in available computational power, from the initial move to GPUs to today's million-fold increases in training capacity on a single model.
The computational requirements for generative media scale dramatically across modalities. If a 200-token LLM prompt costs 1 unit of compute, a single image costs 100x that, and a 5-second video costs another 100x on top of that—a 10,000x total increase. 4K video adds another 10x multiplier.
Today's AI is largely text-based (LLMs). The next phase involves Visual Language Models (VLMs) that interpret and interact with the physical world for robotics and surgery. This transition requires an exponential, 50-1000x increase in compute power, underwriting the long-term AI infrastructure build-out.
OpenAI favors "zero gradient" prompt optimization because serving thousands of unique, fine-tuned model snapshots is operationally very difficult. Prompt-based adjustments allow performance gains without the immense infrastructure burden, making it a more practical and scalable approach for both OpenAI and developers.
Model architecture decisions directly impact inference performance. AI company Zyphra pre-selects target hardware and then chooses model parameters—such as a hidden dimension with many powers of two—to align with how GPUs split up workloads, maximizing efficiency from day one.
Traditional video models process an entire clip at once, causing delays. Descartes' Mirage model is autoregressive, predicting only the next frame based on the input stream and previously generated frames. This LLM-like approach is what enables its real-time, low-latency performance.
AI progress was expected to stall in 2024-2025 due to hardware limitations on pre-training scaling laws. However, breakthroughs in post-training techniques like reasoning and test-time compute provided a new vector for improvement, bridging the gap until next-generation chips like NVIDIA's Blackwell arrived.
While GenAI continues the "learn by example" paradigm of machine learning, its ability to create novel content like images and language is a fundamental step-change. It moves beyond simply predicting patterns to generating entirely new outputs, representing a significant evolution in computing.
AI's computational needs are not just from initial training. They compound exponentially due to post-training (reinforcement learning) and inference (multi-step reasoning), creating a much larger demand profile than previously understood and driving a billion-X increase in compute.
The primary performance bottleneck for LLMs is memory bandwidth (moving large weights), making them memory-bound. In contrast, diffusion-based video models are compute-bound, as they saturate the GPU's processing power by simultaneously denoising tens of thousands of tokens. This represents a fundamental difference in optimization strategy.