
Eric Jang – Building AlphaGo from scratch
Dwarkesh PodcastAI Summary
→ WHAT IT COVERS Eric Jang, former VP of AI at 1X Technologies and Google DeepMind robotics researcher, rebuilds AlphaGo from scratch on sabbatical, explaining Monte Carlo Tree Search, policy and value networks, self-play training loops, and how a 10-layer neural network amortizes what was considered a computationally intractable search problem across a game tree exceeding the number of atoms in the universe. → KEY INSIGHTS - **MCTS Four-Step Loop:** Monte Carlo Tree Search operates as a four-step cycle — selection, expansion, evaluation, backup — repeated across hundreds to thousands of simulations per move. Selection uses the PUCT formula (Q-value plus exploration bonus scaled by prior probability divided by visit count). Each simulation grows the tree one node, evaluates it with the value network, then propagates results back to the root. In AlphaGo Lee matches, tens of thousands of simulations ran per move; modern training requires far fewer. - **Policy Distillation as the Core Training Signal:** AlphaGo's self-improvement mechanism works by treating MCTS output as a superior label for the policy network. After search produces a sharper action distribution than the raw network's initial guess, the network is trained to predict that refined distribution directly. This shifts computational burden from search into the network weights over successive training iterations, meaning each generation starts from a stronger baseline before applying additional simulations on top. - **Value Network Bootstrapping Strategy:** Training the value network accurately before investing compute in MCTS is critical — running search on inaccurate value estimates produces worse distributions than the raw policy alone. Jang recommends initializing with expert human game data or open-source bot self-play to establish reliable late-game value estimates first. On small boards like 9x9, even random-agent games generate enough realistic end-states to bootstrap a usable value function before scaling to 19x19. - **ResNets Outperform Transformers at Low Budget:** For small-data Go training regimes, residual convolutional networks outperform transformers because local convolutional inductive bias matches Go's spatially structured patterns. Transformers require more data to learn local invariances from scratch but offer better global board context once data is sufficient. Katago found it useful to pool global features throughout the network to connect value across distant board regions — a hybrid approach that partially bridges the gap between architectures. - **MCTS vs. Model-Free RL Variance:** Naive policy gradient RL applied to Go suffers from extreme gradient variance because the win/loss signal is diluted across 300 moves per game. With two evenly matched agents playing 100 games, only one or two moves may genuinely differentiate the winner, yet all 30,000 moves receive training signal. MCTS bypasses this credit assignment problem entirely by generating a strictly better action label for every single move, not just rewarding winning trajectories after the fact. - **KataGo's 40x Compute Reduction:** David Wu's open-source KataGo project, released around 2020, achieved approximately a 40x reduction in compute required to train a top-tier Go bot compared to earlier systems. Key contributions included multi-board-size training (transferring value representations from 9x9 to 19x19), global feature pooling in the network architecture, and refined self-play data pipelines. LLM-assisted coding now makes replicating and extending this work achievable for a few thousand dollars of rented compute rather than millions. - **Neural Networks Compressing NP-Hard Search:** A 10-layer neural network with roughly 3 million parameters can approximate the output of an exhaustive game-tree search spanning more states than atoms in the universe. This compression works because Go — like protein folding — has macroscopic structure: predicting who wins is far more tractable than predicting exact board states 100 moves ahead. The value function targets a smooth, averaged quantity over chaotic futures rather than precise trajectory prediction, making it learnable despite underlying combinatorial complexity. → NOTABLE MOMENT Jang argues that AlphaGo's most underappreciated result is not beating a world champion but demonstrating that a small neural network can compress what appears to be an NP-hard search into a single forward pass. He connects this to AlphaFold and raises the possibility that worst-case computational complexity theory may be incomplete when applied to structured real-world problems. 💼 SPONSORS [{"name": "Cursor", "url": "https://cursor.com/dwarkesh"}, {"name": "Jane Street", "url": "https://janestreet.com/tourcash"}] 🏷️ AlphaGo, Monte Carlo Tree Search, Reinforcement Learning, Self-Play Training, Neural Network Architecture, Computational Complexity, Game AI