/

973: AI Systems Performance Engineering, with Chris Fregly

Super Data Science: ML & AI Podcast with Jon Krohn · Mar 10, 2026

Unlock AI performance by going beyond Python. Expert Chris Fregly reveals how to co-design hardware, software & algorithms for massive gains.

DeepSeek Cut Training Costs 20x by Exploiting Underdocumented NVIDIA Hardware Quirks

Despite using inferior chips due to export restrictions, DeepSeek achieved massive cost savings by discovering and utilizing underdocumented hardware features, such as bypassing a specific cache. This proves that deep hardware exploration can yield greater gains than simply acquiring more powerful GPUs.

973: AI Systems Performance Engineering, with Chris Fregly thumbnail

973: AI Systems Performance Engineering, with Chris Fregly

Super Data Science: ML & AI Podcast with Jon Krohn·3 months ago

Forget FLOPS; Memory Bandwidth Is the Most Critical Metric for Large Model GPU Performance

While many focus on compute metrics like FLOPS, the primary bottleneck for large AI models is memory bandwidth—the speed of loading weights into the GPU. This single metric is a better indicator of real-world performance from one GPU generation to the next than raw compute power.

973: AI Systems Performance Engineering, with Chris Fregly thumbnail

973: AI Systems Performance Engineering, with Chris Fregly

Super Data Science: ML & AI Podcast with Jon Krohn·3 months ago

Disaggregated Pre-fill Decode Will Be a Standard AI Engineering Interview Topic by 2026

Optimizing transformer inference, specifically the separation of pre-fill (KV cache building) and decode (token generation), is becoming a foundational skill. Chris Fregly predicts this complex topic, known as disaggregated pre-fill decode, will be a core component of AI engineering interviews at top labs within two years.

973: AI Systems Performance Engineering, with Chris Fregly thumbnail

973: AI Systems Performance Engineering, with Chris Fregly

Super Data Science: ML & AI Podcast with Jon Krohn·3 months ago

For Peak AI Performance, Develop Directly on Target Hardware, Not Local RTX GPUs

Optimizing AI systems on consumer-grade (e.g., RTX) or small-scale professional GPUs is a mistake. The hardware profiles, memory bandwidth, and software components are too different from production systems like Blackwell or Hopper. For performance engineering, the development environment must perfectly mirror the deployment target.

973: AI Systems Performance Engineering, with Chris Fregly thumbnail

973: AI Systems Performance Engineering, with Chris Fregly

Super Data Science: ML & AI Podcast with Jon Krohn·3 months ago

NVIDIA's Documentation is a 'Total Disaster,' Forcing Engineers to Rely on X/Twitter

Author Chris Fregly wrote his 1,000-page book on AI systems because NVIDIA's official documentation is severely lacking. He found more practical information from practitioners on social media and forums, highlighting a massive knowledge gap in the official resources provided by the chip leader.

973: AI Systems Performance Engineering, with Chris Fregly thumbnail

973: AI Systems Performance Engineering, with Chris Fregly

Super Data Science: ML & AI Podcast with Jon Krohn·3 months ago

Top Engineers Are Abandoning Manual Code Reviews and Unit Tests for Continuous AI Evals

To maintain high velocity with AI coding assistants, Chris Fregly has stopped line-by-line code reviews and traditional unit testing. He now focuses on high-level evaluations and 'correctness harnesses' that continuously run in the background, shifting quality control from process (review) to outcome (performance).

973: AI Systems Performance Engineering, with Chris Fregly thumbnail

973: AI Systems Performance Engineering, with Chris Fregly

Super Data Science: ML & AI Podcast with Jon Krohn·3 months ago

PyTorch Profiler Is Insufficient; True Optimization Requires Analyzing 50+ Deeper GPU Metrics

The popular PyTorch Profiler only shows the 'tip of the iceberg.' To achieve meaningful performance gains, engineers must move beyond it and analyze 50-60 low-level GPU metrics related to streaming multiprocessors, instruction pipelines, and specialized function units. Most of the PyTorch community stops too early.

973: AI Systems Performance Engineering, with Chris Fregly thumbnail

973: AI Systems Performance Engineering, with Chris Fregly

Super Data Science: ML & AI Podcast with Jon Krohn·3 months ago

Neo-Clouds like CoreWeave Beat Incumbents by Fully Adopting NVIDIA's Reference Architecture

Newer AI cloud providers gain a performance advantage by building their infrastructure entirely on NVIDIA's integrated ecosystem, including specialized networking. Incumbent clouds often must patch their legacy, CPU-centric systems, creating inefficiencies that 'neo-clouds' without technical debt can avoid.

973: AI Systems Performance Engineering, with Chris Fregly thumbnail

973: AI Systems Performance Engineering, with Chris Fregly

Super Data Science: ML & AI Podcast with Jon Krohn·3 months ago

'Mechanical Sympathy' is the Most Critical Skill for Modern AI Engineers

Borrowing a term from Formula One, Chris Fregly argues that AI engineers must develop a deep, symbiotic understanding of the full hardware-software stack. Rather than just staying at the Python level, true optimizers must co-design algorithms, software, and hardware, just as a champion driver understands how to build their car.

973: AI Systems Performance Engineering, with Chris Fregly thumbnail

973: AI Systems Performance Engineering, with Chris Fregly

Super Data Science: ML & AI Podcast with Jon Krohn·3 months ago