Skip to main content
Dwarkesh Podcast

Eric Jang – Building AlphaGo from scratch

157 min episode · 3 min read
·

Episode

157 min

Read time

3 min

AI-Generated Summary

Key Takeaways

  • MCTS Four-Step Loop: Monte Carlo Tree Search operates as a four-step cycle — selection, expansion, evaluation, backup — repeated across hundreds to thousands of simulations per move. Selection uses the PUCT formula (Q-value plus exploration bonus scaled by prior probability divided by visit count). Each simulation grows the tree one node, evaluates it with the value network, then propagates results back to the root. In AlphaGo Lee matches, tens of thousands of simulations ran per move; modern training requires far fewer.
  • Policy Distillation as the Core Training Signal: AlphaGo's self-improvement mechanism works by treating MCTS output as a superior label for the policy network. After search produces a sharper action distribution than the raw network's initial guess, the network is trained to predict that refined distribution directly. This shifts computational burden from search into the network weights over successive training iterations, meaning each generation starts from a stronger baseline before applying additional simulations on top.
  • Value Network Bootstrapping Strategy: Training the value network accurately before investing compute in MCTS is critical — running search on inaccurate value estimates produces worse distributions than the raw policy alone. Jang recommends initializing with expert human game data or open-source bot self-play to establish reliable late-game value estimates first. On small boards like 9x9, even random-agent games generate enough realistic end-states to bootstrap a usable value function before scaling to 19x19.
  • ResNets Outperform Transformers at Low Budget: For small-data Go training regimes, residual convolutional networks outperform transformers because local convolutional inductive bias matches Go's spatially structured patterns. Transformers require more data to learn local invariances from scratch but offer better global board context once data is sufficient. Katago found it useful to pool global features throughout the network to connect value across distant board regions — a hybrid approach that partially bridges the gap between architectures.
  • MCTS vs. Model-Free RL Variance: Naive policy gradient RL applied to Go suffers from extreme gradient variance because the win/loss signal is diluted across 300 moves per game. With two evenly matched agents playing 100 games, only one or two moves may genuinely differentiate the winner, yet all 30,000 moves receive training signal. MCTS bypasses this credit assignment problem entirely by generating a strictly better action label for every single move, not just rewarding winning trajectories after the fact.

What It Covers

Eric Jang, former VP of AI at 1X Technologies and Google DeepMind robotics researcher, rebuilds AlphaGo from scratch on sabbatical, explaining Monte Carlo Tree Search, policy and value networks, self-play training loops, and how a 10-layer neural network amortizes what was considered a computationally intractable search problem across a game tree exceeding the number of atoms in the universe.

Key Questions Answered

  • MCTS Four-Step Loop: Monte Carlo Tree Search operates as a four-step cycle — selection, expansion, evaluation, backup — repeated across hundreds to thousands of simulations per move. Selection uses the PUCT formula (Q-value plus exploration bonus scaled by prior probability divided by visit count). Each simulation grows the tree one node, evaluates it with the value network, then propagates results back to the root. In AlphaGo Lee matches, tens of thousands of simulations ran per move; modern training requires far fewer.
  • Policy Distillation as the Core Training Signal: AlphaGo's self-improvement mechanism works by treating MCTS output as a superior label for the policy network. After search produces a sharper action distribution than the raw network's initial guess, the network is trained to predict that refined distribution directly. This shifts computational burden from search into the network weights over successive training iterations, meaning each generation starts from a stronger baseline before applying additional simulations on top.
  • Value Network Bootstrapping Strategy: Training the value network accurately before investing compute in MCTS is critical — running search on inaccurate value estimates produces worse distributions than the raw policy alone. Jang recommends initializing with expert human game data or open-source bot self-play to establish reliable late-game value estimates first. On small boards like 9x9, even random-agent games generate enough realistic end-states to bootstrap a usable value function before scaling to 19x19.
  • ResNets Outperform Transformers at Low Budget: For small-data Go training regimes, residual convolutional networks outperform transformers because local convolutional inductive bias matches Go's spatially structured patterns. Transformers require more data to learn local invariances from scratch but offer better global board context once data is sufficient. Katago found it useful to pool global features throughout the network to connect value across distant board regions — a hybrid approach that partially bridges the gap between architectures.
  • MCTS vs. Model-Free RL Variance: Naive policy gradient RL applied to Go suffers from extreme gradient variance because the win/loss signal is diluted across 300 moves per game. With two evenly matched agents playing 100 games, only one or two moves may genuinely differentiate the winner, yet all 30,000 moves receive training signal. MCTS bypasses this credit assignment problem entirely by generating a strictly better action label for every single move, not just rewarding winning trajectories after the fact.
  • KataGo's 40x Compute Reduction: David Wu's open-source KataGo project, released around 2020, achieved approximately a 40x reduction in compute required to train a top-tier Go bot compared to earlier systems. Key contributions included multi-board-size training (transferring value representations from 9x9 to 19x19), global feature pooling in the network architecture, and refined self-play data pipelines. LLM-assisted coding now makes replicating and extending this work achievable for a few thousand dollars of rented compute rather than millions.
  • Neural Networks Compressing NP-Hard Search: A 10-layer neural network with roughly 3 million parameters can approximate the output of an exhaustive game-tree search spanning more states than atoms in the universe. This compression works because Go — like protein folding — has macroscopic structure: predicting who wins is far more tractable than predicting exact board states 100 moves ahead. The value function targets a smooth, averaged quantity over chaotic futures rather than precise trajectory prediction, making it learnable despite underlying combinatorial complexity.

Notable Moment

Jang argues that AlphaGo's most underappreciated result is not beating a world champion but demonstrating that a small neural network can compress what appears to be an NP-hard search into a single forward pass. He connects this to AlphaFold and raises the possibility that worst-case computational complexity theory may be incomplete when applied to structured real-world problems.

Know someone who'd find this useful?

You just read a 3-minute summary of a 154-minute episode.

Get Dwarkesh Podcast summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free
Part of this week's recap (May 11 – May 17)

Keep Reading

More from Dwarkesh Podcast

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

You're clearly into Dwarkesh Podcast.

Every Monday, we deliver AI summaries of the latest episodes from Dwarkesh Podcast and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime