The "Attention is All You Need" paper's key breakthrough was an architecture designed for massive scalability across GPUs. This focus on efficiency, anticipating the industry's shift to larger models, was more crucial to its dominance than the attention mechanism itself.
While more data and compute yield linear improvements, true step-function advances in AI come from unpredictable algorithmic breakthroughs like Transformers. These creative ideas are the most difficult to innovate on and represent the highest-leverage, yet riskiest, area for investment and research focus.
The performance gains from Nvidia's Hopper to Blackwell GPUs come from increased size and power, not efficiency. This signals a potential scaling limit, creating an opportunity for radically new hardware primitives and neural network architectures beyond today's matrix-multiplication-centric models.
The progress in deep learning, from AlexNet's GPU leap to today's massive models, is best understood as a history of scaling compute. This scaling, resulting in a million-fold increase in power, enabled the transition from text to more data-intensive modalities like vision and spatial intelligence.
The progression from early neural networks to today's massive models is fundamentally driven by the exponential increase in available computational power, from the initial move to GPUs to today's million-fold increases in training capacity on a single model.
The plateauing performance-per-watt of GPUs suggests that simply scaling current matrix multiplication-heavy architectures is unsustainable. This hardware limitation may necessitate research into new computational primitives and neural network designs built for large-scale distributed systems, not single devices.
The history of AI, such as the 2012 AlexNet breakthrough, demonstrates that scaling compute and data on simpler, older algorithms often yields greater advances than designing intricate new ones. This "bitter lesson" suggests prioritizing scalability over algorithmic complexity for future progress.
The "bitter lesson" in AI research posits that methods leveraging massive computation scale better and ultimately win out over approaches that rely on human-designed domain knowledge or clever shortcuts, favoring scale over ingenuity.
Top-tier kernels like FlashAttention are co-designed with specific hardware (e.g., H100). This tight coupling makes waiting for future GPUs an impractical strategy. The competitive edge comes from maximizing the performance of available hardware now, even if it means rewriting kernels for each new generation.
Model architecture decisions directly impact inference performance. AI company Zyphra pre-selects target hardware and then chooses model parameters—such as a hidden dimension with many powers of two—to align with how GPUs split up workloads, maximizing efficiency from day one.
Today's transformers are optimized for matrix multiplication (MatMul) on GPUs. However, as compute scales to distributed clusters, MatMul may not be the most efficient primitive. Future AI architectures could be drastically different, built on new primitives better suited for large-scale, distributed hardware.