Eric Jang – Building AlphaGo from scratch
Episode
157 min
Read time
3 min
AI-Generated Summary
Key Takeaways
- ✓MCTS Four-Step Loop: Monte Carlo Tree Search operates as a four-step cycle — selection, expansion, evaluation, backup — repeated across hundreds to thousands of simulations per move. Selection uses the PUCT formula (Q-value plus exploration bonus scaled by prior probability divided by visit count). Each simulation grows the tree one node, evaluates it with the value network, then propagates results back to the root. In AlphaGo Lee matches, tens of thousands of simulations ran per move; modern training requires far fewer.
- ✓Policy Distillation as the Core Training Signal: AlphaGo's self-improvement mechanism works by treating MCTS output as a superior label for the policy network. After search produces a sharper action distribution than the raw network's initial guess, the network is trained to predict that refined distribution directly. This shifts computational burden from search into the network weights over successive training iterations, meaning each generation starts from a stronger baseline before applying additional simulations on top.
- ✓Value Network Bootstrapping Strategy: Training the value network accurately before investing compute in MCTS is critical — running search on inaccurate value estimates produces worse distributions than the raw policy alone. Jang recommends initializing with expert human game data or open-source bot self-play to establish reliable late-game value estimates first. On small boards like 9x9, even random-agent games generate enough realistic end-states to bootstrap a usable value function before scaling to 19x19.
- ✓ResNets Outperform Transformers at Low Budget: For small-data Go training regimes, residual convolutional networks outperform transformers because local convolutional inductive bias matches Go's spatially structured patterns. Transformers require more data to learn local invariances from scratch but offer better global board context once data is sufficient. Katago found it useful to pool global features throughout the network to connect value across distant board regions — a hybrid approach that partially bridges the gap between architectures.
- ✓MCTS vs. Model-Free RL Variance: Naive policy gradient RL applied to Go suffers from extreme gradient variance because the win/loss signal is diluted across 300 moves per game. With two evenly matched agents playing 100 games, only one or two moves may genuinely differentiate the winner, yet all 30,000 moves receive training signal. MCTS bypasses this credit assignment problem entirely by generating a strictly better action label for every single move, not just rewarding winning trajectories after the fact.
What It Covers
Eric Jang, former VP of AI at 1X Technologies and Google DeepMind robotics researcher, rebuilds AlphaGo from scratch on sabbatical, explaining Monte Carlo Tree Search, policy and value networks, self-play training loops, and how a 10-layer neural network amortizes what was considered a computationally intractable search problem across a game tree exceeding the number of atoms in the universe.
Key Questions Answered
- •MCTS Four-Step Loop: Monte Carlo Tree Search operates as a four-step cycle — selection, expansion, evaluation, backup — repeated across hundreds to thousands of simulations per move. Selection uses the PUCT formula (Q-value plus exploration bonus scaled by prior probability divided by visit count). Each simulation grows the tree one node, evaluates it with the value network, then propagates results back to the root. In AlphaGo Lee matches, tens of thousands of simulations ran per move; modern training requires far fewer.
- •Policy Distillation as the Core Training Signal: AlphaGo's self-improvement mechanism works by treating MCTS output as a superior label for the policy network. After search produces a sharper action distribution than the raw network's initial guess, the network is trained to predict that refined distribution directly. This shifts computational burden from search into the network weights over successive training iterations, meaning each generation starts from a stronger baseline before applying additional simulations on top.
- •Value Network Bootstrapping Strategy: Training the value network accurately before investing compute in MCTS is critical — running search on inaccurate value estimates produces worse distributions than the raw policy alone. Jang recommends initializing with expert human game data or open-source bot self-play to establish reliable late-game value estimates first. On small boards like 9x9, even random-agent games generate enough realistic end-states to bootstrap a usable value function before scaling to 19x19.
- •ResNets Outperform Transformers at Low Budget: For small-data Go training regimes, residual convolutional networks outperform transformers because local convolutional inductive bias matches Go's spatially structured patterns. Transformers require more data to learn local invariances from scratch but offer better global board context once data is sufficient. Katago found it useful to pool global features throughout the network to connect value across distant board regions — a hybrid approach that partially bridges the gap between architectures.
- •MCTS vs. Model-Free RL Variance: Naive policy gradient RL applied to Go suffers from extreme gradient variance because the win/loss signal is diluted across 300 moves per game. With two evenly matched agents playing 100 games, only one or two moves may genuinely differentiate the winner, yet all 30,000 moves receive training signal. MCTS bypasses this credit assignment problem entirely by generating a strictly better action label for every single move, not just rewarding winning trajectories after the fact.
- •KataGo's 40x Compute Reduction: David Wu's open-source KataGo project, released around 2020, achieved approximately a 40x reduction in compute required to train a top-tier Go bot compared to earlier systems. Key contributions included multi-board-size training (transferring value representations from 9x9 to 19x19), global feature pooling in the network architecture, and refined self-play data pipelines. LLM-assisted coding now makes replicating and extending this work achievable for a few thousand dollars of rented compute rather than millions.
- •Neural Networks Compressing NP-Hard Search: A 10-layer neural network with roughly 3 million parameters can approximate the output of an exhaustive game-tree search spanning more states than atoms in the universe. This compression works because Go — like protein folding — has macroscopic structure: predicting who wins is far more tractable than predicting exact board states 100 moves ahead. The value function targets a smooth, averaged quantity over chaotic futures rather than precise trajectory prediction, making it learnable despite underlying combinatorial complexity.
Notable Moment
Jang argues that AlphaGo's most underappreciated result is not beating a world champion but demonstrating that a small neural network can compress what appears to be an NP-hard search into a single forward pass. He connects this to AlphaFold and raises the possibility that worst-case computational complexity theory may be incomplete when applied to structured real-world problems.
You just read a 3-minute summary of a 154-minute episode.
Get Dwarkesh Podcast summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Dwarkesh Podcast
Reiner Pope – Chip design from the bottom up
May 22 · 80 min
Marketing School
The AI Search Strategy That Actually Works
May 25
More from Dwarkesh Podcast
David Reich – Why the Bronze Age was an inflection point in human evolution
May 8 · 133 min
a16z Podcast
Why AI Isn’t Killing SaaS Yet
May 25
More from Dwarkesh Podcast
We summarize every new episode. Want them in your inbox?
Reiner Pope – Chip design from the bottom up
David Reich – Why the Bronze Age was an inflection point in human evolution
Reiner Pope – The math behind how LLMs are trained and served
Jensen Huang – TPU competition, why we should sell chips to China, & Nvidia’s supply chain moat
Michael Nielsen – How science actually progresses
Similar Episodes
Related episodes from other podcasts
Marketing School
May 25
The AI Search Strategy That Actually Works
a16z Podcast
May 25
Why AI Isn’t Killing SaaS Yet
Animal Spirits
May 25
Talk Your Book: Investing in the Rise of the Robots
Capital Allocators
May 25
Fundraising Mastery: The Tao of Kimmer – John Kim (EP.503)
How I Built This
May 25
Justin’s Nut Butter: Justin Gold. He Was Waiting Tables, Then...He Reinvented Peanut Butter.
You're clearly into Dwarkesh Podcast.
Every Monday, we deliver AI summaries of the latest episodes from Dwarkesh Podcast and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime