Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay
Episode
92 min
Read time
3 min
Topics
Economics & Policy
AI-Generated Summary
Key Takeaways
- ✓On-Policy vs Off-Policy RL: On-policy learning means models generate their own outputs, receive rewards based on those generations, then train on their own trajectories rather than imitating others' successful paths. This approach proves more generalizable than supervised fine-tuning because models learn from first principles by making mistakes and receiving feedback, similar to how humans learn through trial and error rather than pure imitation of expert demonstrations.
- ✓IMO Gold Architecture Decision: DeepMind made the bold choice to completely abandon AlphaProof's specialized symbolic system and use end-to-end Gemini models for IMO 2024. The training process took approximately one week with four co-captains across London, Mountain View, and Singapore coordinating in different time zones. This live competition format with problems released on different days created more adrenaline than standard benchmark optimization since gold medal thresholds depend on human participant performance.
- ✓Reasoning as Post-Training RL: The technical definition of reasoning in modern LLMs centers on using reinforcement learning during post-training to elicit better thinking capabilities. This involves training models to improve with extended thinking time, whether through discrete token chain-of-thought or latent space representations. The field has moved beyond pure architecture innovation toward RL as the primary modeling toolset for capability improvements at the frontier.
- ✓Data Efficiency Gap: Humans demonstrate eight orders of magnitude better data efficiency than current models - a two-year-old child shows more capability than LLMs after seeing vastly less data. The solution likely involves spending more compute per token during training rather than just scaling dataset size. This represents a fundamental research direction as the field approaches data constraints, with the bug potentially residing in backpropagation, architecture, or learning algorithms.
- ✓Transformer Architecture Persistence: Self-attention will likely remain central to AGI systems despite eight years since the original paper. Attempts to remove or simplify attention consistently fail unless at least one attention layer remains. The architecture serves as the interface between learning algorithms and tokens, with any paradigm shift requiring changes to the entire learning system including backpropagation, not just architectural modifications.
What It Covers
Yi Tay, co-captain of Google DeepMind's IMO Gold effort and leader of the Singapore Gemini team, discusses the decision to abandon AlphaProof's symbolic system for end-to-end Gemini models, on-policy reinforcement learning philosophy, the one-week training process for IMO models, and why data efficiency and world models represent the next frontier in AI research beyond pure scaling.
Key Questions Answered
- •On-Policy vs Off-Policy RL: On-policy learning means models generate their own outputs, receive rewards based on those generations, then train on their own trajectories rather than imitating others' successful paths. This approach proves more generalizable than supervised fine-tuning because models learn from first principles by making mistakes and receiving feedback, similar to how humans learn through trial and error rather than pure imitation of expert demonstrations.
- •IMO Gold Architecture Decision: DeepMind made the bold choice to completely abandon AlphaProof's specialized symbolic system and use end-to-end Gemini models for IMO 2024. The training process took approximately one week with four co-captains across London, Mountain View, and Singapore coordinating in different time zones. This live competition format with problems released on different days created more adrenaline than standard benchmark optimization since gold medal thresholds depend on human participant performance.
- •Reasoning as Post-Training RL: The technical definition of reasoning in modern LLMs centers on using reinforcement learning during post-training to elicit better thinking capabilities. This involves training models to improve with extended thinking time, whether through discrete token chain-of-thought or latent space representations. The field has moved beyond pure architecture innovation toward RL as the primary modeling toolset for capability improvements at the frontier.
- •Data Efficiency Gap: Humans demonstrate eight orders of magnitude better data efficiency than current models - a two-year-old child shows more capability than LLMs after seeing vastly less data. The solution likely involves spending more compute per token during training rather than just scaling dataset size. This represents a fundamental research direction as the field approaches data constraints, with the bug potentially residing in backpropagation, architecture, or learning algorithms.
- •Transformer Architecture Persistence: Self-attention will likely remain central to AGI systems despite eight years since the original paper. Attempts to remove or simplify attention consistently fail unless at least one attention layer remains. The architecture serves as the interface between learning algorithms and tokens, with any paradigm shift requiring changes to the entire learning system including backpropagation, not just architectural modifications.
- •AI Coding Productivity Shift: AI coding tools crossed an emergent threshold in 2024 where researchers can paste bugs directly into systems like Anthropic's Claude without examining them, receive fixes, and relaunch jobs with high success rates. This represents "vibe training" beyond basic code generation - the model investigates and solves problems the researcher doesn't fully understand. Time saved per bug can reach full workdays, creating passive productivity buffs across entire teams.
- •Geographic Research Strategy: Singapore's advantage for frontier AI research lies in being far enough from Bay Area culture for mental space while maintaining connectivity. The timezone enables 24-hour job coverage when coordinating with London and Mountain View teams. Talent density matters more than team size, with hiring focused on exceptional RL research track records or competitive programming achievements rather than rapid scaling.
Notable Moment
Tay reveals he knew nothing about the IMO competition itself and only trained the model checkpoint used for the live event. Team members flew to Australia to receive problems as they released, running inference in real-time while others gathered in London for a hackathon-style coordination. The gold medal threshold depended on human participant scores, creating uncertainty until final verification by the IMO committee.
You just read a 3-minute summary of a 89-minute episode.
Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Latent Space
AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)
Apr 23 · 54 min
The Mel Robbins Podcast
Do THIS Every Day to Rewire Your Brain From Stress and Anxiety
Apr 27
More from Latent Space
Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO
Apr 22 · 72 min
The Model Health Show
The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow
Apr 27
More from Latent Space
We summarize every new episode. Want them in your inbox?
AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)
Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO
🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik
Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion
Extreme Harness Engineering for Token Billionaires: 1M LOC, 1B toks/day, 0% human code, 0% human review — Ryan Lopopolo, OpenAI Frontier & Symphony
Similar Episodes
Related episodes from other podcasts
The Mel Robbins Podcast
Apr 27
Do THIS Every Day to Rewire Your Brain From Stress and Anxiety
The Model Health Show
Apr 27
The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow
The Rest is History
Apr 26
664. Britain in the 70s: Scandal in Downing Street (Part 3)
The Learning Leader Show
Apr 26
685: David Epstein - The Freedom Trap, Narrative Values, General Magic, The Nobel Prize Winner Who Simplified Everything, Wearing the Same Thing Everyday, and Why Constraints Are the Secret to Your Best Work
The AI Breakdown
Apr 26
Where the Economy Thrives After AI
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
You're clearly into Latent Space.
Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime