RiffOn - How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Zyphra's head of model training discusses their all-in bet on AMD, the craft of kernel writing, and disciplined AI-assisted coding.

Combat LLM Context Rot by Periodically Summarizing and Restarting Chats

Long conversations degrade LLM performance as attention gets clogged with irrelevant details. An expert workflow is to stop, ask the model to summarize the key points of the discussion, and then start a fresh chat with that summary as the initial prompt. This keeps the context clean and the model on track.

How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Latent Space: The AI Engineer Podcast·4 months ago

Hire for Intellectual Velocity, Not Skills; A Physicist Outshines a Stagnant CUDA Expert

For cutting-edge AI problems, innate curiosity and learning speed ("velocity") are more important than existing domain knowledge. Echoing Karpathy, a candidate with a track record of diving deep into complex topics, regardless of field, will outperform a skilled but less-driven specialist.

How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Latent Space: The AI Engineer Podcast·4 months ago

Co-designing LLMs with Target Hardware Unlocks Major Inference Efficiency Gains

Model architecture decisions directly impact inference performance. AI company Zyphra pre-selects target hardware and then chooses model parameters—such as a hidden dimension with many powers of two—to align with how GPUs split up workloads, maximizing efficiency from day one.

How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Latent Space: The AI Engineer Podcast·4 months ago

Focused, Funded Teams Outperform Large Consortia in Open-Source AI

The key to successful open-source AI isn't uniting everyone into a massive project. Instead, EleutherAI's model proves more effective: creating small, siloed teams with guaranteed compute and end-to-end funding for a single, specific research problem. This avoids organizational overhead and ensures completion.

How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Latent Space: The AI Engineer Podcast·4 months ago

AI Teams Win by Optimizing for Today's GPUs, Not Waiting for Tomorrow's

Top-tier kernels like FlashAttention are co-designed with specific hardware (e.g., H100). This tight coupling makes waiting for future GPUs an impractical strategy. The competitive edge comes from maximizing the performance of available hardware now, even if it means rewriting kernels for each new generation.

How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Latent Space: The AI Engineer Podcast·4 months ago

LLMs Fail at Low-Level GPU Programming Due to Scarce Data and Debugging Complexity

AI coding assistants struggle with deep kernel work (CUDA, PTX) because there's little public code to learn from. Furthermore, debugging AI-generated parallel code is extremely difficult because the developer lacks the original mental model, making it less efficient than writing it themselves.

How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Latent Space: The AI Engineer Podcast·4 months ago

AMD's MI300X GPUs Outperform NVIDIA H100 on Memory-Intensive LLM Training

The MI300X's superior memory bandwidth and 192GB VRAM make it faster than H100s for non-FP8 dense transformers or MoE models. Quentin Anthony from Zyphra notes AMD's software has caught up, creating an under-appreciated arbitrage opportunity for teams willing to build on their stack.

How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Latent Space: The AI Engineer Podcast·4 months ago

Peak GPU Performance Comes From Bottom-Up Kernel Design, Not Top-Down Compilers

Instead of using high-level compilers like Triton, elite programmers design algorithms based on specific hardware properties (e.g., AMD's MI300X). This bottom-up approach ensures the code fully exploits the hardware's strengths, a level of control often lost through abstractions like Triton.

How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Latent Space: The AI Engineer Podcast·4 months ago

How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Combat LLM Context Rot by Periodically Summarizing and Restarting Chats

Hire for Intellectual Velocity, Not Skills; A Physicist Outshines a Stagnant CUDA Expert

Co-designing LLMs with Target Hardware Unlocks Major Inference Efficiency Gains

Top AI-Powered Engineers Deconstruct Problems and Apply LLMs Selectively

Focused, Funded Teams Outperform Large Consortia in Open-Source AI

AI Teams Win by Optimizing for Today's GPUs, Not Waiting for Tomorrow's

LLMs Fail at Low-Level GPU Programming Due to Scarce Data and Debugging Complexity

AMD's MI300X GPUs Outperform NVIDIA H100 on Memory-Intensive LLM Training

Peak GPU Performance Comes From Bottom-Up Kernel Design, Not Top-Down Compilers

How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Combat LLM Context Rot by Periodically Summarizing and Restarting Chats

Hire for Intellectual Velocity, Not Skills; A Physicist Outshines a Stagnant CUDA Expert

Co-designing LLMs with Target Hardware Unlocks Major Inference Efficiency Gains

Top AI-Powered Engineers Deconstruct Problems and Apply LLMs Selectively

Focused, Funded Teams Outperform Large Consortia in Open-Source AI

AI Teams Win by Optimizing for Today's GPUs, Not Waiting for Tomorrow's

LLMs Fail at Low-Level GPU Programming Due to Scarce Data and Debugging Complexity

AMD's MI300X GPUs Outperform NVIDIA H100 on Memory-Intensive LLM Training

Peak GPU Performance Comes From Bottom-Up Kernel Design, Not Top-Down Compilers