Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

Instead of pinpointing which specific action led to a good outcome, the GRPO algorithm solves the credit assignment problem with a simple heuristic: it assumes rare tokens in a high-scoring output were responsible and upweights all of them. This "unsatisfying" but practical approach works surprisingly well.

Related Insights

Reinforcement learning achieves superhuman results not by inventing alien concepts, but by surfacing and combining rare behaviors that are already possible within a model's vast pre-trained distribution. The goal of pre-training is to make this search for novel solutions more efficient and less random.

Modern LLMs use a simple form of reinforcement learning that directly rewards successful outcomes. This contrasts with more sophisticated methods, like those in AlphaGo or the brain, which use "value functions" to estimate long-term consequences. It's a mystery why the simpler approach is so effective.

Karpathy criticizes standard reinforcement learning as a noisy and inefficient process. It assigns credit or blame to an entire sequence of actions based on a single outcome bit (success/failure). This is like "sucking supervision through a straw," as it fails to identify which specific steps in a successful trajectory were actually correct.

Much RL research from 2015-2022 has not proven useful in practice because academia rewards complex, math-heavy ideas. These provide implicit "knobs" to overfit benchmarks, while ignoring simpler, more generalizable approaches that may lack intellectual novelty.

The distinction between imitation learning and reinforcement learning (RL) is not a rigid dichotomy. Next-token prediction in LLMs can be framed as a form of RL where the "episode" is just one token long and the reward is based on prediction accuracy. This conceptual model places both learning paradigms on a continuous spectrum rather than in separate categories.

When determining what data an RL model should consider, resist including every available feature. Instead, observe how experienced human decision-makers reason about the problem. Their simplified mental models reveal the core signals that truly drive outcomes, leading to more stable, faster-learning, and more interpretable AI systems.

On-policy reinforcement learning, where a model learns from its own generated actions and their consequences, is analogous to how humans learn from direct experience and mistakes. This contrasts with off-policy methods like supervised fine-tuning (SFT), which resemble simply imitating others' successful paths.

OpenPipe's 'Ruler' library leverages a key insight: GRPO only needs relative rankings, not absolute scores. By having an LLM judge stack-rank a group of agent runs, one can generate effective rewards. This approach works phenomenally well, even with weaker judge models, effectively solving the reward assignment problem.

In the endgame, AlphaGo made moves that seemed suboptimal, even giving up points. This was because it wasn't optimizing for a large victory margin (a human heuristic) but purely for maximizing the probability of winning, even by a half-point. This reveals how literal AI objective functions can differ from human proxies for success.

The GRPO algorithm wasn't a huge theoretical leap over its predecessors. It became famous because DeepSeek did the significant engineering work to scale it and, crucially, released a high-performing model that proved the method's practical viability.