RiffOn - [State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

An OpenAI researcher discusses the shift to post-training, the complexity of RL, token efficiency, and the need for hybrid systems/ML talent.

AI Coding Assistants Create a Disruptive 'Burst-and-Wait' Workflow for Developers

Tools like OpenAI's Codex can complete hours of coding in minutes following a design phase. This creates awkward, inefficient downtime periods for the developer, fundamentally altering the daily work rhythm from a steady flow to unproductive cycles of intense work followed by waiting.

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

Latent Space: The AI Engineer Podcast·2 months ago

The AI Alignment Debate Should Shift from RL Algorithms to the Spectrum of Data Signal Quality

Focusing on which reinforcement learning algorithm is best (e.g., PPO vs. DPO) is misguided. The more critical factor is the quality and verifiability of the input data signal itself, which exists on a spectrum from subjective human preference (RLHF) to objective, verifiable truth.

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

Latent Space: The AI Engineer Podcast·2 months ago

Reinforcement Learning's High Operational Burden Comes from Managing Diverse Task Infrastructures

Unlike pre-training's simpler data pipeline, RL involves many "moving parts" because each task can have a unique grading setup and infrastructure. This complexity, not just the algorithm itself, is the primary challenge for researchers managing live training runs at scale.

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

Latent Space: The AI Engineer Podcast·2 months ago

User-Interruptible Chain-of-Thought Is a Key New AI Interaction Paradigm

Advanced models are moving beyond simple prompt-response cycles. New interfaces, like in OpenAI's shopping model, allow users to interrupt the model's reasoning process (its "chain of thought") to provide real-time corrections, representing a powerful new way for humans to collaborate with AI agents.

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

Latent Space: The AI Engineer Podcast·2 months ago

The Biggest AI Talent Gap Is for Researchers Who Master Both ML Theory and Distributed Systems

Top AI labs struggle to find people skilled in both ML research and systems engineering. Progress is often bottlenecked by one or the other, requiring individuals who can seamlessly switch between optimizing algorithms and building the underlying infrastructure, a hybrid skillset rarely taught in academia.

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

Latent Space: The AI Engineer Podcast·2 months ago

Token Efficiency Is a More Critical Metric Than Time for Advancing Long-Horizon AI Agents

Progress in complex, long-running agentic tasks is better measured by tokens consumed rather than raw time. Improving token efficiency, as seen from GPT-5 to 5.1, directly enables more tool calls and actions within a feasible operational budget, unlocking greater capabilities.

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

Latent Space: The AI Engineer Podcast·2 months ago