/
© 2026 RiffOn. All rights reserved.
  1. Training Data
  2. Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters
Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

Training Data · Oct 28, 2025

Nvidia CTO Michael Kagan on how Mellanox's interconnect tech was pivotal in shifting from Moore's Law to data-center-scale AI computing.

In Massive GPU Clusters, the Probability of All Components Working is Zero; Design For Failure

When building systems with hundreds of thousands of GPUs and millions of components, it's a statistical certainty that something is always broken. Therefore, hardware and software must be architected from the ground up to handle constant, inevitable failures while maintaining performance and service availability.

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters thumbnail

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

Training Data·4 months ago

Nvidia’s Modern 'GPU' is a Forklift-Sized Rack, Not a Single Chip

The fundamental unit of AI compute has evolved from a silicon chip to a complete, rack-sized system. According to Nvidia's CTO, a single 'GPU' is now an integrated machine that requires a forklift to move, a crucial mindset shift for understanding modern AI infrastructure scale.

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters thumbnail

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

Training Data·4 months ago

Generative AI's Recursive Nature Makes Inference as Compute-Intensive as Training

Unlike simple classification (one pass), generative AI performs recursive inference. Each new token (word, pixel) requires a full pass through the model, turning a single prompt into a series of demanding computations. This makes inference a major, ongoing driver of GPU demand, rivaling training.

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters thumbnail

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

Training Data·4 months ago

Nvidia's Dominance Hinges on Mellanox's Networking, Which Unlocked Data-Center Scale Computing

The exponential growth in AI required moving beyond single GPUs. Mellanox's interconnect technology was critical for scaling to thousands of GPUs, effectively turning the entire data center into a single, high-performance computer and solving the post-Moore's Law scaling challenge.

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters thumbnail

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

Training Data·4 months ago

Nvidia Uses DPUs to Isolate Data Center OS from Applications, Reducing Cyber Attack Surfaces

By running infrastructure tasks on a separate computing platform (the Bluefield DPU), Nvidia isolates the data center's operating system from tenant applications on GPUs. This prevents vulnerabilities from crossing over, significantly hardening the system against side-channel attacks and other cyber threats.

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters thumbnail

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

Training Data·4 months ago

Consistent, Low-Jitter Network Latency is More Critical Than Peak Speed for Large AI Clusters

When splitting jobs across thousands of GPUs, inconsistent communication times (jitter) create bottlenecks, forcing the use of fewer GPUs. A network with predictable, uniform latency enables far greater parallelization and overall cluster efficiency, making it more important than raw 'hero number' bandwidth.

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters thumbnail

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

Training Data·4 months ago