Top-tier kernels like FlashAttention are co-designed with specific hardware (e.g., H100). This tight coupling makes waiting for future GPUs an impractical strategy. The competitive edge comes from maximizing the performance of available hardware now, even if it means rewriting kernels for each new generation.

Related Insights

The performance gains from Nvidia's Hopper to Blackwell GPUs come from increased size and power, not efficiency. This signals a potential scaling limit, creating an opportunity for radically new hardware primitives and neural network architectures beyond today's matrix-multiplication-centric models.

The MI300X's superior memory bandwidth and 192GB VRAM make it faster than H100s for non-FP8 dense transformers or MoE models. Quentin Anthony from Zyphra notes AMD's software has caught up, creating an under-appreciated arbitrage opportunity for teams willing to build on their stack.

The plateauing performance-per-watt of GPUs suggests that simply scaling current matrix multiplication-heavy architectures is unsustainable. This hardware limitation may necessitate research into new computational primitives and neural network designs built for large-scale distributed systems, not single devices.

The history of AI, such as the 2012 AlexNet breakthrough, demonstrates that scaling compute and data on simpler, older algorithms often yields greater advances than designing intricate new ones. This "bitter lesson" suggests prioritizing scalability over algorithmic complexity for future progress.

In the fast-paced world of AI, focusing only on the limitations of current models is a failing strategy. GitHub's CPO advises product teams to design for the future capabilities they anticipate. This ensures that when a more powerful model drops, the product experience can be rapidly upgraded to its full potential.

Model architecture decisions directly impact inference performance. AI company Zyphra pre-selects target hardware and then chooses model parameters—such as a hidden dimension with many powers of two—to align with how GPUs split up workloads, maximizing efficiency from day one.

AI progress was expected to stall in 2024-2025 due to hardware limitations on pre-training scaling laws. However, breakthroughs in post-training techniques like reasoning and test-time compute provided a new vector for improvement, bridging the gap until next-generation chips like NVIDIA's Blackwell arrived.

Fal maintains a performance edge by building a specialized just-in-time (JIT) compiler for diffusion models. This verticalized approach, inspired by PyTorch 2.0 but more focused, generates more efficient kernels than generalized tools, creating a defensible technical moat.

Instead of using high-level compilers like Triton, elite programmers design algorithms based on specific hardware properties (e.g., AMD's MI300X). This bottom-up approach ensures the code fully exploits the hardware's strengths, a level of control often lost through abstractions like Triton.

Today's transformers are optimized for matrix multiplication (MatMul) on GPUs. However, as compute scales to distributed clusters, MatMul may not be the most efficient primitive. Future AI architectures could be drastically different, built on new primitives better suited for large-scale, distributed hardware.

AI Teams Win by Optimizing for Today's GPUs, Not Waiting for Tomorrow's | RiffOn