Tools like OpenAI's Codex can complete hours of coding in minutes following a design phase. This creates awkward, inefficient downtime periods for the developer, fundamentally altering the daily work rhythm from a steady flow to unproductive cycles of intense work followed by waiting.
Focusing on which reinforcement learning algorithm is best (e.g., PPO vs. DPO) is misguided. The more critical factor is the quality and verifiability of the input data signal itself, which exists on a spectrum from subjective human preference (RLHF) to objective, verifiable truth.
Unlike pre-training's simpler data pipeline, RL involves many "moving parts" because each task can have a unique grading setup and infrastructure. This complexity, not just the algorithm itself, is the primary challenge for researchers managing live training runs at scale.
Advanced models are moving beyond simple prompt-response cycles. New interfaces, like in OpenAI's shopping model, allow users to interrupt the model's reasoning process (its "chain of thought") to provide real-time corrections, representing a powerful new way for humans to collaborate with AI agents.
Top AI labs struggle to find people skilled in both ML research and systems engineering. Progress is often bottlenecked by one or the other, requiring individuals who can seamlessly switch between optimizing algorithms and building the underlying infrastructure, a hybrid skillset rarely taught in academia.
Progress in complex, long-running agentic tasks is better measured by tokens consumed rather than raw time. Improving token efficiency, as seen from GPT-5 to 5.1, directly enables more tool calls and actions within a feasible operational budget, unlocking greater capabilities.
![[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI](https://assets.flightcast.com/V2Uploads/nvaja2542wefzb8rjg5f519m/01K4D8FB4MNA071BM5ZDSMH34N/square.jpg)