Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

January 23, 2026

92 min episode · 3 min read

Yi Tay

Episode

92 min

Read time

3 min

Topics

Economics & Policy

AI-Generated Summary

Published Jan 24, 2026

Key Takeaways

✓On-Policy vs Off-Policy RL: On-policy learning means models generate their own outputs and train on those trajectories with reward signals, similar to humans making mistakes and learning from feedback. Off-policy learning mimics other models' successful trajectories through supervised fine-tuning. On-policy proves more generalizable because models learn from first principles rather than copying, though the science comparing both approaches remains incomplete. This philosophical shift drives modern LLM reinforcement learning.
✓IMO Gold Training Process: Four co-captains across London, Mountain View, and Singapore trained the IMO model in approximately one week, coordinating across time zones by passing work between shifts. The team abandoned AlphaProof's specialized symbolic system for pure end-to-end Gemini, betting that general-purpose models with sufficient parameters can subsume specialized tools. The live competition format meant no benchmark climbing—teams received problems in real-time and ran inference immediately, creating genuine uncertainty about achieving gold.
✓Learning Rate Adjustment Strategy: When encountering one counterexample that breaks your worldview after ten years of consistent experience, update your beliefs by 20-50%, not 2%. Most people update too slowly when proven wrong, being overly Bayesian when they should recognize their entire prior model is invalid. This applies to both human learning and AI research—when Stable Diffusion emerged, it required complete mental model revision, not incremental adjustment.
✓Reasoning as RL Post-Training: Reasoning now means using reinforcement learning and post-training to elicit better thinking capabilities from models. It encompasses chain-of-thought tokens, thinking traces, and trajectory optimization. The technical definition: making models better with thinking through RL, whether discrete token thinking or latent thinking in vector space. This represents a shift from viewing reasoning as mysterious to treating it as systematic capability improvement through specific training methods.
✓Data Efficiency Bottleneck: Humans achieve vastly higher data efficiency than models—a two-year-old recognizes dogs from three examples while models need thousands. The solution involves spending more FLOPS per token to extract maximum learning from limited data, not just adding more data. With internet data becoming finite, research focuses on better learning algorithms and architectures that can squeeze more capability from each training token, potentially through world model approaches or improved backpropagation methods.

What It Covers

Yi Tay, co-captain of Google DeepMind's IMO Gold achievement, discusses the decision to abandon AlphaProof's specialized system for an end-to-end Gemini model, the shift from off-policy to on-policy reinforcement learning, training models in one week across four time zones, and establishing DeepMind's Reasoning and AGI team in Singapore while reflecting on architecture debates and data efficiency challenges.

Key Questions Answered

•On-Policy vs Off-Policy RL: On-policy learning means models generate their own outputs and train on those trajectories with reward signals, similar to humans making mistakes and learning from feedback. Off-policy learning mimics other models' successful trajectories through supervised fine-tuning. On-policy proves more generalizable because models learn from first principles rather than copying, though the science comparing both approaches remains incomplete. This philosophical shift drives modern LLM reinforcement learning.
•IMO Gold Training Process: Four co-captains across London, Mountain View, and Singapore trained the IMO model in approximately one week, coordinating across time zones by passing work between shifts. The team abandoned AlphaProof's specialized symbolic system for pure end-to-end Gemini, betting that general-purpose models with sufficient parameters can subsume specialized tools. The live competition format meant no benchmark climbing—teams received problems in real-time and ran inference immediately, creating genuine uncertainty about achieving gold.
•Learning Rate Adjustment Strategy: When encountering one counterexample that breaks your worldview after ten years of consistent experience, update your beliefs by 20-50%, not 2%. Most people update too slowly when proven wrong, being overly Bayesian when they should recognize their entire prior model is invalid. This applies to both human learning and AI research—when Stable Diffusion emerged, it required complete mental model revision, not incremental adjustment.
•Reasoning as RL Post-Training: Reasoning now means using reinforcement learning and post-training to elicit better thinking capabilities from models. It encompasses chain-of-thought tokens, thinking traces, and trajectory optimization. The technical definition: making models better with thinking through RL, whether discrete token thinking or latent thinking in vector space. This represents a shift from viewing reasoning as mysterious to treating it as systematic capability improvement through specific training methods.
•Data Efficiency Bottleneck: Humans achieve vastly higher data efficiency than models—a two-year-old recognizes dogs from three examples while models need thousands. The solution involves spending more FLOPS per token to extract maximum learning from limited data, not just adding more data. With internet data becoming finite, research focuses on better learning algorithms and architectures that can squeeze more capability from each training token, potentially through world model approaches or improved backpropagation methods.
•AI Coding Productivity Gains: Models now fix bugs without human inspection—paste error messages into tools like Anthropic's Claude, apply suggested fixes, and relaunch jobs. This surpasses basic "vibe coding" where you know the solution but want automation. Models investigate and solve problems faster than manual debugging would take, sometimes saving full workdays. The shift represents moving from AI as occasional helper to AI as automatic debugging partner for high-expertise ML work.
•Geographic Distribution Strategy: DeepMind Singapore operates with 24-hour coverage across London, Mountain View, and Singapore time zones, enabling continuous model training and debugging. Geography matters for talent density and cultural diversity—some researchers refuse Bay Area relocation but join Singapore or London teams. The region provides "far enough for peace and quiet, close enough to stay connected" while accessing strong local talent pools attracted by frontier AGI research opportunities.

Notable Moment

Tay reveals that when the IMO model checkpoint was delivered for the live competition, he knew nothing about IMO mathematics and couldn't solve the problems himself. He watched human participants' scores to determine if Gemini would achieve gold, since the medal threshold depends on a bell curve of human performance. The outcome remained genuinely uncertain until results arrived, creating real adrenaline unlike standard benchmark climbing.

Know someone who'd find this useful?

You just read a 3-minute summary of a 89-minute episode.

Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Apr 23 · 54 min

The Mel Robbins Podcast

Do THIS Every Day to Rewire Your Brain From Stress and Anxiety

Apr 27

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

Apr 22 · 72 min

The Model Health Show

The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow

Apr 27

Similar Episodes

Related episodes from other podcasts

The Mel Robbins Podcast

Apr 27

685: David Epstein - The Freedom Trap, Narrative Values, General Magic, The Nobel Prize Winner Who Simplified Everything, Wearing the Same Thing Everyday, and Why Constraints Are the Secret to Your Best Work

The AI Breakdown

Apr 26

Where the Economy Thrives After AI

Explore Related Topics

🌐Economics & Policy

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

You're clearly into Latent Space.

Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime

Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Do THIS Every Day to Rewire Your Brain From Stress and Anxiety

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow

More from Latent Space

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

Extreme Harness Engineering for Token Billionaires: 1M LOC, 1B toks/day, 0% human code, 0% human review — Ryan Lopopolo, OpenAI Frontier & Symphony

Similar Episodes

Do THIS Every Day to Rewire Your Brain From Stress and Anxiety

The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow

664. Britain in the 70s: Scandal in Downing Street (Part 3)

685: David Epstein - The Freedom Trap, Narrative Values, General Magic, The Nobel Prize Winner Who Simplified Everything, Wearing the Same Thing Everyday, and Why Constraints Are the Secret to Your Best Work

Where the Economy Thrives After AI

Explore Related Topics

You're clearly into Latent Space.