The GRPO Algorithm Succeeds By Upweighting All Rare Tokens in Successful Outputs

Related Insights

Superhuman AI Performance Comes from RL Eliciting Latent, Pre-Trained Capabilities

Reinforcement learning achieves superhuman results not by inventing alien concepts, but by surfacing and combining rare behaviors that are already possible within a model's vast pre-trained distribution. The goal of pre-training is to make this search for novel solutions more efficient and less random.

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·21 hours ago

Leading AI Researchers Find It "Crazy" That LLMs Work Without Value Functions

Modern LLMs use a simple form of reinforcement learning that directly rewards successful outcomes. This contrasts with more sophisticated methods, like those in AlphaGo or the brain, which use "value functions" to estimate long-term consequences. It's a mystery why the simpler approach is so effective.

Adam Marblestone – AI is missing something fundamental about the brain

Dwarkesh Podcast·4 months ago

Reinforcement Learning Inefficiently "Sucks Supervision Through a Straw"

Karpathy criticizes standard reinforcement learning as a noisy and inefficient process. It assigns credit or blame to an entire sequence of actions based on a single outcome bit (success/failure). This is like "sucking supervision through a straw," as it fails to identify which specific steps in a successful trajectory were actually correct.

Andrej Karpathy — AGI is still a decade away

Dwarkesh Podcast·6 months ago

Academic RL Research Overfits Benchmarks by Rewarding Complex Theories Over Simple Methods

Much RL research from 2015-2022 has not proven useful in practice because academia rewards complex, math-heavy ideas. These provide implicit "knobs" to overfit benchmarks, while ignoring simpler, more generalizable approaches that may lack intellectual novelty.

[State of RL/Reasoning] IMO/IOI Gold, OpenAI o3/GPT-5, and Cursor Composer — Ashvin Nair, Cursor

Latent Space: The AI Engineer Podcast·4 months ago

View LLM Imitation Learning as Reinforcement Learning with a One-Token Horizon

The distinction between imitation learning and reinforcement learning (RL) is not a rigid dichotomy. Next-token prediction in LLMs can be framed as a form of RL where the "episode" is just one token long and the reward is based on prediction accuracy. This conceptual model places both learning paradigms on a continuous spectrum rather than in separate categories.

Some thoughts on the Sutton interview

Dwarkesh Podcast·7 months ago

Model RL State Representation by Observing How Human Experts Simplify, Not by Ingesting All Data

When determining what data an RL model should consider, resist including every available feature. Instead, observe how experienced human decision-makers reason about the problem. Their simplified mental models reveal the core signals that truly drive outcomes, leading to more stable, faster-learning, and more interpretable AI systems.

Building Product Pricing Using Reinforcement Learning Algorithms: The Realities Behind the Architect

Machine Learning Tech Brief By HackerNoon·4 months ago

On-Policy RL Mirrors Human Learning by Rewarding Self-Generated Actions, Unlike Imitative Off-Policy Methods

On-policy reinforcement learning, where a model learns from its own generated actions and their consequences, is analogous to how humans learn from direct experience and mistakes. This contrasts with off-policy methods like supervised fine-tuning (SFT), which resemble simply imitating others' successful paths.

Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

Latent Space: The AI Engineer Podcast·3 months ago

LLM-as-Judge Stack Ranking Solves the RL Reward Problem for GRPO

OpenPipe's 'Ruler' library leverages a key insight: GRPO only needs relative rankings, not absolute scores. By having an LLM judge stack-rank a group of agent runs, one can generate effective rewards. This approach works phenomenally well, even with weaker judge models, effectively solving the reward assignment problem.

Why Fine-Tuning Lost and RL Won

Latent Space: The AI Engineer Podcast·7 months ago

AlphaGo Optimized for Win Probability, Not Score Margin, Creating Counterintuitive Behavior

In the endgame, AlphaGo made moves that seemed suboptimal, even giving up points. This was because it wasn't optimizing for a large victory margin (a human heuristic) but purely for maximizing the probability of winning, even by a half-point. This reveals how literal AI objective functions can differ from human proxies for success.

10 Years of AlphaGo: The Turning Point for AI | Thore Graepel & Pushmeet Kohli

Google DeepMind: The Podcast·2 months ago

DeepSeek's GRPO Paper Gained Fame for Its Scaled Implementation, Not Its Novelty

The GRPO algorithm wasn't a huge theoretical leap over its predecessors. It became famous because DeepSeek did the significant engineering work to scale it and, crucially, released a high-performing model that proved the method's practical viability.

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·21 hours ago

Get your free personalized podcast brief

Related Insights