AlphaGo's RL Is a Stable Supervised Learning Loop, Bypassing Policy Gradient's High Variance

Related Insights

AlphaGo Made Go Solvable by Using Neural Nets to Prune Its Intractable Search Tree

Go's search space is larger than the number of atoms in the universe, making exhaustive search impossible. AlphaGo's core breakthrough was using neural networks to intelligently guide its search, evaluating only the most promising moves and making an intractable problem solvable.

Eric Jang – Building AlphaGo from scratch

Dwarkesh Podcast·a month ago

AlphaGo's Success Combined 'Fast' Intuitive Neural Networks with 'Slow' Deliberate Search

AlphaGo's architecture mimicked human cognition by pairing a 'fast thinking' neural network for intuition with a 'slow thinking' search algorithm for explicit planning. This hybrid model, combining pattern recognition with calculation, proved more powerful for tackling complex problems than either approach alone.

10 Years of AlphaGo: The Turning Point for AI | Thore Graepel & Pushmeet Kohli

Google DeepMind: The Podcast·4 months ago

AI Achieves Superhuman Performance in Verifiable Domains Like Coding Via "Experiential Learning"

In domains like coding and math where correctness is automatically verifiable, AI can move beyond imitating humans (RLHF). Using pure reinforcement learning, or "experiential learning," models learn via self-play and can discover novel, superhuman strategies similar to AlphaGo's Move 37.

Inside The $2.2B AI Research Accelerator | Turing

Sourcery·9 months ago

MCTS Provides Corrective Training Data for Go AI, Mimicking Robotics' Dagger Algorithm

MCTS acts like the Dagger (Dataset Aggregation) algorithm in robotics. For every state in a game, even one on a losing path, MCTS provides a 'better' action. This teaches the policy not just the optimal path, but also how to recover and get back to it from suboptimal states, creating a more robust agent.

Eric Jang – Building AlphaGo from scratch

Dwarkesh Podcast·a month ago

AlphaGo Learns from MCTS's Full Probability Distribution, Not Just a Single Best Move

Instead of training on the single best action from its search (a one-hot label), AlphaGo's policy network learns to imitate the entire probability distribution of moves from MCTS. This 'soft label' contains far more information, enabling a much more effective and sample-efficient form of knowledge distillation.

Eric Jang – Building AlphaGo from scratch

Dwarkesh Podcast·a month ago

AlphaGo's Policy Network Learns to Predict the Outcome of Its Own MCTS Search

Monte Carlo Tree Search (MCTS) acts as a 'policy improvement operator.' After the search finds a better move distribution, the policy network is trained to directly predict this improved distribution. This distills the expensive search process into the network itself, making it stronger over time.

Eric Jang – Building AlphaGo from scratch

Dwarkesh Podcast·a month ago

Reinforcement Learning Inefficiently "Sucks Supervision Through a Straw"

Karpathy criticizes standard reinforcement learning as a noisy and inefficient process. It assigns credit or blame to an entire sequence of actions based on a single outcome bit (success/failure). This is like "sucking supervision through a straw," as it fails to identify which specific steps in a successful trajectory were actually correct.

Andrej Karpathy — AGI is still a decade away

Dwarkesh Podcast·8 months ago

AlphaGo's Value Network Truncates Game Tree Search by Predicting Mid-Game Outcomes

Humans stop analyzing a game when they intuit a winning or losing position. AlphaGo’s value function mimics this by predicting the eventual outcome from any board state. This allows the search to be drastically shortened, as it doesn't need to play out every possibility to the very end.

Eric Jang – Building AlphaGo from scratch

Dwarkesh Podcast·a month ago

On-Policy RL Mirrors Human Learning by Rewarding Self-Generated Actions, Unlike Imitative Off-Policy Methods

On-policy reinforcement learning, where a model learns from its own generated actions and their consequences, is analogous to how humans learn from direct experience and mistakes. This contrasts with off-policy methods like supervised fine-tuning (SFT), which resemble simply imitating others' successful paths.

Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

Latent Space: The AI Engineer Podcast·5 months ago

AlphaGo Optimized for Win Probability, Not Score Margin, Creating Counterintuitive Behavior

In the endgame, AlphaGo made moves that seemed suboptimal, even giving up points. This was because it wasn't optimizing for a large victory margin (a human heuristic) but purely for maximizing the probability of winning, even by a half-point. This reveals how literal AI objective functions can differ from human proxies for success.

10 Years of AlphaGo: The Turning Point for AI | Thore Graepel & Pushmeet Kohli

Google DeepMind: The Podcast·4 months ago

Get your free personalized podcast brief

Related Insights