AlphaGo's Value Network Truncates Game Tree Search by Predicting Mid-Game Outcomes

Related Insights

AlphaGo Made Go Solvable by Using Neural Nets to Prune Its Intractable Search Tree

Go's search space is larger than the number of atoms in the universe, making exhaustive search impossible. AlphaGo's core breakthrough was using neural networks to intelligently guide its search, evaluating only the most promising moves and making an intractable problem solvable.

Eric Jang – Building AlphaGo from scratch

Dwarkesh Podcast·a month ago

AlphaGo's Success Combined 'Fast' Intuitive Neural Networks with 'Slow' Deliberate Search

AlphaGo's architecture mimicked human cognition by pairing a 'fast thinking' neural network for intuition with a 'slow thinking' search algorithm for explicit planning. This hybrid model, combining pattern recognition with calculation, proved more powerful for tackling complex problems than either approach alone.

10 Years of AlphaGo: The Turning Point for AI | Thore Graepel & Pushmeet Kohli

Google DeepMind: The Podcast·4 months ago

Leading AI Researchers Find It "Crazy" That LLMs Work Without Value Functions

Modern LLMs use a simple form of reinforcement learning that directly rewards successful outcomes. This contrasts with more sophisticated methods, like those in AlphaGo or the brain, which use "value functions" to estimate long-term consequences. It's a mystery why the simpler approach is so effective.

Adam Marblestone – AI is missing something fundamental about the brain

Dwarkesh Podcast·6 months ago

AlphaGo Learns from MCTS's Full Probability Distribution, Not Just a Single Best Move

Instead of training on the single best action from its search (a one-hot label), AlphaGo's policy network learns to imitate the entire probability distribution of moves from MCTS. This 'soft label' contains far more information, enabling a much more effective and sample-efficient form of knowledge distillation.

Eric Jang – Building AlphaGo from scratch

Dwarkesh Podcast·a month ago

AlphaGo's Policy Network Learns to Predict the Outcome of Its Own MCTS Search

Monte Carlo Tree Search (MCTS) acts as a 'policy improvement operator.' After the search finds a better move distribution, the policy network is trained to directly predict this improved distribution. This distills the expensive search process into the network itself, making it stronger over time.

Eric Jang – Building AlphaGo from scratch

Dwarkesh Podcast·a month ago

AGI Requires Combining LLMs with AlphaGo's Planning and Search Techniques

Google DeepMind CEO Demis Hassabis argues that today's large models are insufficient for AGI. He believes progress requires reintroducing algorithmic techniques from systems like AlphaGo, specifically planning and search, to enable more robust reasoning and problem-solving capabilities beyond simple pattern matching.

Best of Big Technology: Demis Hassabis On AGI, Deceptive AIs, Building a Virtual Cell

Big Technology Podcast·6 months ago

A Small Neural Network Can Amortize a Vast Search, Compressing Deep Simulation into One Glance

A key insight from AlphaGo is that a relatively shallow neural network can approximate the result of an incredibly deep and complex search tree. This suggests neural nets can learn to compress sequential, recursive computation into a single, efficient forward pass.

Eric Jang – Building AlphaGo from scratch

Dwarkesh Podcast·a month ago

AlphaGo's RL Is a Stable Supervised Learning Loop, Bypassing Policy Gradient's High Variance

Unlike typical reinforcement learning which learns from sparse win/loss signals, AlphaGo's method is remarkably stable. It uses MCTS to generate an 'improved' move for every state, turning the problem into a simple supervised learning task of imitating a better version of itself, avoiding high-variance gradients.

Eric Jang – Building AlphaGo from scratch

Dwarkesh Podcast·a month ago

AlphaGo Optimized for Win Probability, Not Score Margin, Creating Counterintuitive Behavior

In the endgame, AlphaGo made moves that seemed suboptimal, even giving up points. This was because it wasn't optimizing for a large victory margin (a human heuristic) but purely for maximizing the probability of winning, even by a half-point. This reveals how literal AI objective functions can differ from human proxies for success.

10 Years of AlphaGo: The Turning Point for AI | Thore Graepel & Pushmeet Kohli

Google DeepMind: The Podcast·4 months ago

Advanced AI Like DeepMind's AlphaGo Runs on the Same Learning Algorithm as Brain Dopamine Systems

The "temporal difference" algorithm, which tracks changing expectations, isn't just a theoretical model. It is biologically installed in brains via dopamine. This same algorithm was externalized by DeepMind to create a world-champion Go-playing AI, representing a unique instance of biology directly inspiring a major technological breakthrough.

How Dopamine & Serotonin Shape Decisions, Motivation & Learning | Dr. Read Montague

Huberman Lab·5 months ago

Get your free personalized podcast brief

Related Insights