The exponential growth in AI required moving beyond single GPUs. Mellanox's interconnect technology was critical for scaling to thousands of GPUs, effectively turning the entire data center into a single, high-performance computer and solving the post-Moore's Law scaling challenge.
The performance gains from Nvidia's Hopper to Blackwell GPUs come from increased size and power, not efficiency. This signals a potential scaling limit, creating an opportunity for radically new hardware primitives and neural network architectures beyond today's matrix-multiplication-centric models.
By funding and backstopping CoreWeave, which exclusively uses its GPUs, NVIDIA establishes its hardware as the default for the AI cloud. This gives NVIDIA leverage over major customers like Microsoft and Amazon, who are developing their own chips. It makes switching to proprietary silicon more difficult, creating a competitive moat based on market structure, not just technology.
The progress in deep learning, from AlexNet's GPU leap to today's massive models, is best understood as a history of scaling compute. This scaling, resulting in a million-fold increase in power, enabled the transition from text to more data-intensive modalities like vision and spatial intelligence.
The progression from early neural networks to today's massive models is fundamentally driven by the exponential increase in available computational power, from the initial move to GPUs to today's million-fold increases in training capacity on a single model.
While known for its GPUs, NVIDIA's true competitive moat is CUDA, a free software platform that made its hardware accessible for diverse applications like research and AI. This created a powerful network effect and stickiness that competitors struggled to replicate, making NVIDIA more of a software company than observers realize.
The plateauing performance-per-watt of GPUs suggests that simply scaling current matrix multiplication-heavy architectures is unsustainable. This hardware limitation may necessitate research into new computational primitives and neural network designs built for large-scale distributed systems, not single devices.
While AI inference can be decentralized, training the most powerful models demands extreme centralization of compute. The necessity for high-bandwidth, low-latency communication between GPUs means the best models are trained by concentrating hardware in the smallest possible physical space, a direct contradiction to decentralized ideals.
When splitting jobs across thousands of GPUs, inconsistent communication times (jitter) create bottlenecks, forcing the use of fewer GPUs. A network with predictable, uniform latency enables far greater parallelization and overall cluster efficiency, making it more important than raw 'hero number' bandwidth.
AI's computational needs are not just from initial training. They compound exponentially due to post-training (reinforcement learning) and inference (multi-step reasoning), creating a much larger demand profile than previously understood and driving a billion-X increase in compute.
The fundamental unit of AI compute has evolved from a silicon chip to a complete, rack-sized system. According to Nvidia's CTO, a single 'GPU' is now an integrated machine that requires a forklift to move, a crucial mindset shift for understanding modern AI infrastructure scale.