Skip to main content
Latent Space

Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

92 min episode · 3 min read
·

Episode

92 min

Read time

3 min

Topics

Economics & Policy

AI-Generated Summary

Key Takeaways

  • On-Policy vs Off-Policy RL: On-policy learning means models generate their own outputs and train on those trajectories with reward signals, similar to humans making mistakes and learning from feedback. Off-policy learning mimics other models' successful trajectories through supervised fine-tuning. On-policy proves more generalizable because models learn from first principles rather than copying, though the science comparing both approaches remains incomplete. This philosophical shift drives modern LLM reinforcement learning.
  • IMO Gold Training Process: Four co-captains across London, Mountain View, and Singapore trained the IMO model in approximately one week, coordinating across time zones by passing work between shifts. The team abandoned AlphaProof's specialized symbolic system for pure end-to-end Gemini, betting that general-purpose models with sufficient parameters can subsume specialized tools. The live competition format meant no benchmark climbing—teams received problems in real-time and ran inference immediately, creating genuine uncertainty about achieving gold.
  • Learning Rate Adjustment Strategy: When encountering one counterexample that breaks your worldview after ten years of consistent experience, update your beliefs by 20-50%, not 2%. Most people update too slowly when proven wrong, being overly Bayesian when they should recognize their entire prior model is invalid. This applies to both human learning and AI research—when Stable Diffusion emerged, it required complete mental model revision, not incremental adjustment.
  • Reasoning as RL Post-Training: Reasoning now means using reinforcement learning and post-training to elicit better thinking capabilities from models. It encompasses chain-of-thought tokens, thinking traces, and trajectory optimization. The technical definition: making models better with thinking through RL, whether discrete token thinking or latent thinking in vector space. This represents a shift from viewing reasoning as mysterious to treating it as systematic capability improvement through specific training methods.
  • Data Efficiency Bottleneck: Humans achieve vastly higher data efficiency than models—a two-year-old recognizes dogs from three examples while models need thousands. The solution involves spending more FLOPS per token to extract maximum learning from limited data, not just adding more data. With internet data becoming finite, research focuses on better learning algorithms and architectures that can squeeze more capability from each training token, potentially through world model approaches or improved backpropagation methods.

What It Covers

Yi Tay, co-captain of Google DeepMind's IMO Gold achievement, discusses the decision to abandon AlphaProof's specialized system for an end-to-end Gemini model, the shift from off-policy to on-policy reinforcement learning, training models in one week across four time zones, and establishing DeepMind's Reasoning and AGI team in Singapore while reflecting on architecture debates and data efficiency challenges.

Key Questions Answered

  • On-Policy vs Off-Policy RL: On-policy learning means models generate their own outputs and train on those trajectories with reward signals, similar to humans making mistakes and learning from feedback. Off-policy learning mimics other models' successful trajectories through supervised fine-tuning. On-policy proves more generalizable because models learn from first principles rather than copying, though the science comparing both approaches remains incomplete. This philosophical shift drives modern LLM reinforcement learning.
  • IMO Gold Training Process: Four co-captains across London, Mountain View, and Singapore trained the IMO model in approximately one week, coordinating across time zones by passing work between shifts. The team abandoned AlphaProof's specialized symbolic system for pure end-to-end Gemini, betting that general-purpose models with sufficient parameters can subsume specialized tools. The live competition format meant no benchmark climbing—teams received problems in real-time and ran inference immediately, creating genuine uncertainty about achieving gold.
  • Learning Rate Adjustment Strategy: When encountering one counterexample that breaks your worldview after ten years of consistent experience, update your beliefs by 20-50%, not 2%. Most people update too slowly when proven wrong, being overly Bayesian when they should recognize their entire prior model is invalid. This applies to both human learning and AI research—when Stable Diffusion emerged, it required complete mental model revision, not incremental adjustment.
  • Reasoning as RL Post-Training: Reasoning now means using reinforcement learning and post-training to elicit better thinking capabilities from models. It encompasses chain-of-thought tokens, thinking traces, and trajectory optimization. The technical definition: making models better with thinking through RL, whether discrete token thinking or latent thinking in vector space. This represents a shift from viewing reasoning as mysterious to treating it as systematic capability improvement through specific training methods.
  • Data Efficiency Bottleneck: Humans achieve vastly higher data efficiency than models—a two-year-old recognizes dogs from three examples while models need thousands. The solution involves spending more FLOPS per token to extract maximum learning from limited data, not just adding more data. With internet data becoming finite, research focuses on better learning algorithms and architectures that can squeeze more capability from each training token, potentially through world model approaches or improved backpropagation methods.
  • AI Coding Productivity Gains: Models now fix bugs without human inspection—paste error messages into tools like Anthropic's Claude, apply suggested fixes, and relaunch jobs. This surpasses basic "vibe coding" where you know the solution but want automation. Models investigate and solve problems faster than manual debugging would take, sometimes saving full workdays. The shift represents moving from AI as occasional helper to AI as automatic debugging partner for high-expertise ML work.
  • Geographic Distribution Strategy: DeepMind Singapore operates with 24-hour coverage across London, Mountain View, and Singapore time zones, enabling continuous model training and debugging. Geography matters for talent density and cultural diversity—some researchers refuse Bay Area relocation but join Singapore or London teams. The region provides "far enough for peace and quiet, close enough to stay connected" while accessing strong local talent pools attracted by frontier AGI research opportunities.

Notable Moment

Tay reveals that when the IMO model checkpoint was delivered for the live competition, he knew nothing about IMO mathematics and couldn't solve the problems himself. He watched human participants' scores to determine if Gemini would achieve gold, since the medal threshold depends on a bell curve of human performance. The outcome remained genuinely uncertain until results arrived, creating real adrenaline unlike standard benchmark climbing.

Know someone who'd find this useful?

You just read a 3-minute summary of a 89-minute episode.

Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

More from Latent Space

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

Explore Related Topics

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

You're clearly into Latent Space.

Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime