What are the key takeaways from this Dwarkesh Podcast episode?

Key insights include: **MCTS Four-Step Loop:** Monte Carlo Tree Search operates as a four-step cycle — selection, expansion, evaluation, backup — repeated across hundreds to thousands of simulations per move. Selection uses the PUCT formula (Q-value plus exploration bonus scaled by prior probability divided by visit count). Each simulation grows the tree one node, evaluates it with the value network, then propagates results back to the root. In AlphaGo Lee matches, tens of thousands of simulations ran per move; modern training requires far fewer.; **Policy Distillation as the Core Training Signal:** AlphaGo's self-improvement mechanism works by treating MCTS output as a superior label for the policy network. After search produces a sharper action distribution than the raw network's initial guess, the network is trained to predict that refined distribution directly. This shifts computational burden from search into the network weights over successive training iterations, meaning each generation starts from a stronger baseline before applying additional simulations on top.; **Value Network Bootstrapping Strategy:** Training the value network accurately before investing compute in MCTS is critical — running search on inaccurate value estimates produces worse distributions than the raw policy alone. Jang recommends initializing with expert human game data or open-source bot self-play to establish reliable late-game value estimates first. On small boards like 9x9, even random-agent games generate enough realistic end-states to bootstrap a usable value function before scaling to 19x19.

What did Eric Jang discuss on Dwarkesh Podcast?

Eric Jang, former VP of AI at 1X Technologies and Google DeepMind robotics researcher, rebuilds AlphaGo from scratch on sabbatical, explaining Monte Carlo Tree Search, policy and value networks, self-play training loops, and how a 10-layer neural network amortizes what was considered a computationally intractable search problem across a game tree exceeding the number of atoms in the universe. Key topics include: **MCTS Four-Step Loop:** Monte Carlo Tree Search operates as a four-step cycle — selection, expansion, evaluation, backup — repeated across hundreds to thousands of simulations per move. Selection uses the PUCT formula (Q-value plus exploration bonus scaled by prior probability divided by visit count). Each simulation grows the tree one node, evaluates it with the value network, then propagates results back to the root. In AlphaGo Lee matches, tens of thousands of simulations ran per move; modern training requires far fewer.; **Policy Distillation as the Core Training Signal:** AlphaGo's self-improvement mechanism works by treating MCTS output as a superior label for the policy network. After search produces a sharper action distribution than the raw network's initial guess, the network is trained to predict that refined distribution directly. This shifts computational burden from search into the network weights over successive training iterations, meaning each generation starts from a stronger baseline before applying additional simulations on top..

How long is this episode of Dwarkesh Podcast?

This episode is 157 minutes long. SignalCast provides an AI-generated summary so you can get the key insights in about 3 minutes.

Dwarkesh Podcast

Eric Jang – Building AlphaGo from scratch

May 15, 2026

157 min episode · 3 min read

Eric Jang

Episode

157 min

Read time

3 min

Topics

Investing, Startups, Fundraising & VC

AI-Generated Summary

Published May 15, 2026

Key Takeaways

✓MCTS Four-Step Loop: Monte Carlo Tree Search operates as a four-step cycle — selection, expansion, evaluation, backup — repeated across hundreds to thousands of simulations per move. Selection uses the PUCT formula (Q-value plus exploration bonus scaled by prior probability divided by visit count). Each simulation grows the tree one node, evaluates it with the value network, then propagates results back to the root. In AlphaGo Lee matches, tens of thousands of simulations ran per move; modern training requires far fewer.
✓Policy Distillation as the Core Training Signal: AlphaGo's self-improvement mechanism works by treating MCTS output as a superior label for the policy network. After search produces a sharper action distribution than the raw network's initial guess, the network is trained to predict that refined distribution directly. This shifts computational burden from search into the network weights over successive training iterations, meaning each generation starts from a stronger baseline before applying additional simulations on top.
✓Value Network Bootstrapping Strategy: Training the value network accurately before investing compute in MCTS is critical — running search on inaccurate value estimates produces worse distributions than the raw policy alone. Jang recommends initializing with expert human game data or open-source bot self-play to establish reliable late-game value estimates first. On small boards like 9x9, even random-agent games generate enough realistic end-states to bootstrap a usable value function before scaling to 19x19.
✓ResNets Outperform Transformers at Low Budget: For small-data Go training regimes, residual convolutional networks outperform transformers because local convolutional inductive bias matches Go's spatially structured patterns. Transformers require more data to learn local invariances from scratch but offer better global board context once data is sufficient. Katago found it useful to pool global features throughout the network to connect value across distant board regions — a hybrid approach that partially bridges the gap between architectures.
✓MCTS vs. Model-Free RL Variance: Naive policy gradient RL applied to Go suffers from extreme gradient variance because the win/loss signal is diluted across 300 moves per game. With two evenly matched agents playing 100 games, only one or two moves may genuinely differentiate the winner, yet all 30,000 moves receive training signal. MCTS bypasses this credit assignment problem entirely by generating a strictly better action label for every single move, not just rewarding winning trajectories after the fact.

What It Covers

Eric Jang, former VP of AI at 1X Technologies and Google DeepMind robotics researcher, rebuilds AlphaGo from scratch on sabbatical, explaining Monte Carlo Tree Search, policy and value networks, self-play training loops, and how a 10-layer neural network amortizes what was considered a computationally intractable search problem across a game tree exceeding the number of atoms in the universe.

Key Questions Answered

•MCTS Four-Step Loop: Monte Carlo Tree Search operates as a four-step cycle — selection, expansion, evaluation, backup — repeated across hundreds to thousands of simulations per move. Selection uses the PUCT formula (Q-value plus exploration bonus scaled by prior probability divided by visit count). Each simulation grows the tree one node, evaluates it with the value network, then propagates results back to the root. In AlphaGo Lee matches, tens of thousands of simulations ran per move; modern training requires far fewer.
•Policy Distillation as the Core Training Signal: AlphaGo's self-improvement mechanism works by treating MCTS output as a superior label for the policy network. After search produces a sharper action distribution than the raw network's initial guess, the network is trained to predict that refined distribution directly. This shifts computational burden from search into the network weights over successive training iterations, meaning each generation starts from a stronger baseline before applying additional simulations on top.
•Value Network Bootstrapping Strategy: Training the value network accurately before investing compute in MCTS is critical — running search on inaccurate value estimates produces worse distributions than the raw policy alone. Jang recommends initializing with expert human game data or open-source bot self-play to establish reliable late-game value estimates first. On small boards like 9x9, even random-agent games generate enough realistic end-states to bootstrap a usable value function before scaling to 19x19.
•ResNets Outperform Transformers at Low Budget: For small-data Go training regimes, residual convolutional networks outperform transformers because local convolutional inductive bias matches Go's spatially structured patterns. Transformers require more data to learn local invariances from scratch but offer better global board context once data is sufficient. Katago found it useful to pool global features throughout the network to connect value across distant board regions — a hybrid approach that partially bridges the gap between architectures.
•MCTS vs. Model-Free RL Variance: Naive policy gradient RL applied to Go suffers from extreme gradient variance because the win/loss signal is diluted across 300 moves per game. With two evenly matched agents playing 100 games, only one or two moves may genuinely differentiate the winner, yet all 30,000 moves receive training signal. MCTS bypasses this credit assignment problem entirely by generating a strictly better action label for every single move, not just rewarding winning trajectories after the fact.
•KataGo's 40x Compute Reduction: David Wu's open-source KataGo project, released around 2020, achieved approximately a 40x reduction in compute required to train a top-tier Go bot compared to earlier systems. Key contributions included multi-board-size training (transferring value representations from 9x9 to 19x19), global feature pooling in the network architecture, and refined self-play data pipelines. LLM-assisted coding now makes replicating and extending this work achievable for a few thousand dollars of rented compute rather than millions.
•Neural Networks Compressing NP-Hard Search: A 10-layer neural network with roughly 3 million parameters can approximate the output of an exhaustive game-tree search spanning more states than atoms in the universe. This compression works because Go — like protein folding — has macroscopic structure: predicting who wins is far more tractable than predicting exact board states 100 moves ahead. The value function targets a smooth, averaged quantity over chaotic futures rather than precise trajectory prediction, making it learnable despite underlying combinatorial complexity.

Notable Moment

Jang argues that AlphaGo's most underappreciated result is not beating a world champion but demonstrating that a small neural network can compress what appears to be an NP-hard search into a single forward pass. He connects this to AlphaFold and raises the possibility that worst-case computational complexity theory may be incomplete when applied to structured real-world problems.

Know someone who'd find this useful?

You just read a 3-minute summary of a 154-minute episode.

Get Dwarkesh Podcast summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Books, tools, and gear mentioned in this episode

SignalCast may earn commission on purchases via these links.

Tools

Cursor
“SPONSORS: Cursor”
KataGoRecommended
by David Wu
“David Wu's open-source KataGo project, released around 2020, achieved approximately a 40x reduction in compute required to train a top-tier Go bot”

other

KataGo
by David Wu
“David Wu's open-source KataGo project, released around 2020, achieved approximately a 40x reduction in compute required to train a top-tier Go bot compared to earlier systems. Key contributions included multi-board-size training (transferring value representations from 9x9 to 19x19), global feature pooling in the network architecture, and refined self-play data pipelines.”
AlphaGo
by Google DeepMind
“Eric Jang, former VP of AI at 1X Technologies and Google DeepMind robotics researcher, rebuilds AlphaGo from scratch on sabbatical, explaining Monte Carlo Tree Search, policy and value networks, self-play training loops, and how a 10-layer neural network amortizes what was considered a computationally intractable search problem”
AlphaFold
by Google DeepMind
“He connects this to AlphaFold and raises the possibility that worst-case computational complexity theory may be incomplete when applied to structured real-world problems.”

company

Google DeepMind
“Eric Jang, former VP of AI at 1X Technologies and Google DeepMind robotics researcher”
1X Technologies
“Eric Jang, former VP of AI at 1X Technologies and Google DeepMind robotics researcher”

Similar Episodes

Related episodes from other podcasts

The TWIML AI Podcast

Jul 8

Explore Related Topics

📈Investing 🚀Startups 💰Fundraising & VC

Read this week's Investing & Markets Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into Dwarkesh Podcast.

Every Monday, we deliver AI summaries of the latest episodes from Dwarkesh Podcast and 192+ other podcasts. Free for one show.

Start My Monday Digest

No credit card · Unsubscribe anytime

Eric Jang – Building AlphaGo from scratch

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

Grant Sanderson – AI and the future of math

How AI Learns to Smell with Alex Wiltschko - #771

The next big breakthrough will be AIs learning on the job

AI Agents Can Code 10,000 Lines of Hacking Tools In Seconds - Dr. Ilia Shumailov (ex-GDM)

Books, tools, and gear mentioned in this episode

Tools

other

company

More from Dwarkesh Podcast

Grant Sanderson – AI and the future of math

The next big breakthrough will be AIs learning on the job

The data black hole at the center of AI

Ada Palmer – Machiavelli is the most misunderstood thinker of all time

Alex Imas and Phil Trammell – What remains scarce after AGI?

Similar Episodes

How AI Learns to Smell with Alex Wiltschko - #771

AI Agents Can Code 10,000 Lines of Hacking Tools In Seconds - Dr. Ilia Shumailov (ex-GDM)

Reinventing the Python Notebook with Akshay Agrawal

#321 Nick Frosst: Why Cohere Is Betting on Enterprise AI, Not AGI

Why Google fell behind in the AI race

Explore Related Topics

You're clearly into Dwarkesh Podcast.