DeepSeek's GRPO Paper Gained Fame for Its Scaled Implementation, Not Its Novelty

Related Insights

Reproducible Sandbox Environments Are RL's Biggest Bottleneck, Not Algorithms

Algorithms like GRPO are powerful but require parallel rollouts in a reproducible environment. Building and maintaining these high-fidelity sandboxes, complete with realistic data and failure modes, is the hardest part of implementing RL today and a significant barrier for most companies.

Why Fine-Tuning Lost and RL Won

Latent Space: The AI Engineer Podcast·7 months ago

AI Capability Improves Non-Linearly With Massive Increases in Training Data

A key surprise in AI development was the non-linear impact of scale. Sebastian Thrun noted that while AI trained on millions of documents is 'fine,' training it on hundreds of billions creates an 'unbelievably smart' system, shocking even its creators and demonstrating data volume as a primary driver of breakthroughs.

Search Engine Presents: Are you a good driver?

Odd Lots·24 days ago

Compute is the Real Unlock, Not Clever Algorithms

The history of AI, such as the 2012 AlexNet breakthrough, demonstrates that scaling compute and data on simpler, older algorithms often yields greater advances than designing intricate new ones. This "bitter lesson" suggests prioritizing scalability over algorithmic complexity for future progress.

The Frontier of Spatial Intelligence with Fei-Fei Li

a16z Podcast·6 months ago

AI's 'Bitter Lesson': Massive Compute Consistently Beats Human-Crafted Heuristics

The "bitter lesson" in AI research posits that methods leveraging massive computation scale better and ultimately win out over approaches that rely on human-designed domain knowledge or clever shortcuts, favoring scale over ingenuity.

#172: Sora 2, Claude Sonnet 4.5, ChatGPT Instant Checkout, How OpenAI Uses AI, Grokipedia & Mercor’s AI Productivity Index

The Artificial Intelligence Show·7 months ago

OpenAI Succeeded by Ignoring Academic Papers for Large-Scale Engineering Results

When OpenAI started, the AI research community measured progress via peer-reviewed papers. OpenAI's contrarian move was to pour millions into GPUs and large-scale engineering aimed at tangible results, a strategy criticized by academics but which ultimately led to their breakthrough.

How To Be Contrarian — And Right

Lightcone Podcast·6 months ago

AI Progress Still Relies on Game-Changing Ideas, Not Just Blind Scaling

Contrary to the "bitter lesson" narrative that scale is all that matters, novel ideas remain a critical driver of AI progress. The field is not yet experiencing diminishing returns on new concepts; game-changing ideas are still being invented and are essential for making scaling effective in the first place.

Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

Latent Space: The AI Engineer Podcast·3 months ago

Academic RL Research Overfits Benchmarks by Rewarding Complex Theories Over Simple Methods

Much RL research from 2015-2022 has not proven useful in practice because academia rewards complex, math-heavy ideas. These provide implicit "knobs" to overfit benchmarks, while ignoring simpler, more generalizable approaches that may lack intellectual novelty.

[State of RL/Reasoning] IMO/IOI Gold, OpenAI o3/GPT-5, and Cursor Composer — Ashvin Nair, Cursor

Latent Space: The AI Engineer Podcast·4 months ago

LLM-as-Judge Stack Ranking Solves the RL Reward Problem for GRPO

OpenPipe's 'Ruler' library leverages a key insight: GRPO only needs relative rankings, not absolute scores. By having an LLM judge stack-rank a group of agent runs, one can generate effective rewards. This approach works phenomenally well, even with weaker judge models, effectively solving the reward assignment problem.

Why Fine-Tuning Lost and RL Won

Latent Space: The AI Engineer Podcast·7 months ago

The GRPO Algorithm Succeeds By Upweighting All Rare Tokens in Successful Outputs

Instead of pinpointing which specific action led to a good outcome, the GRPO algorithm solves the credit assignment problem with a simple heuristic: it assumes rare tokens in a high-scoring output were responsible and upweights all of them. This "unsatisfying" but practical approach works surprisingly well.

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·21 hours ago

OpenAI Discovered Scaling Simple Algorithms Beats Designing Complex Ones

The Dota team expected their simple PPO algorithm to fail, hoping it would force innovation. Instead, they found that massive compute applied to a supposedly "flawed" algorithm could achieve superhuman results. This became a foundational insight for OpenAI's scaling-first strategy.

Greg Brockman: Inside the 72 Hours That Almost Killed OpenAI

The Knowledge Project·10 days ago

Get your free personalized podcast brief

Related Insights