Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay

January 23, 2026

92 min episode · 3 min read

Yi Tay

Episode

92 min

Read time

3 min

Topics

Economics & Policy

AI-Generated Summary

Published Feb 3, 2026

Key Takeaways

✓On-Policy vs Off-Policy RL: On-policy learning means models generate their own outputs, receive rewards based on those generations, then train on their own trajectories rather than imitating others' successful paths. This approach proves more generalizable than supervised fine-tuning because models learn from first principles by making mistakes and receiving feedback, similar to how humans learn through trial and error rather than pure imitation of expert demonstrations.
✓IMO Gold Architecture Decision: DeepMind made the bold choice to completely abandon AlphaProof's specialized symbolic system and use end-to-end Gemini models for IMO 2024. The training process took approximately one week with four co-captains across London, Mountain View, and Singapore coordinating in different time zones. This live competition format with problems released on different days created more adrenaline than standard benchmark optimization since gold medal thresholds depend on human participant performance.
✓Reasoning as Post-Training RL: The technical definition of reasoning in modern LLMs centers on using reinforcement learning during post-training to elicit better thinking capabilities. This involves training models to improve with extended thinking time, whether through discrete token chain-of-thought or latent space representations. The field has moved beyond pure architecture innovation toward RL as the primary modeling toolset for capability improvements at the frontier.
✓Data Efficiency Gap: Humans demonstrate eight orders of magnitude better data efficiency than current models - a two-year-old child shows more capability than LLMs after seeing vastly less data. The solution likely involves spending more compute per token during training rather than just scaling dataset size. This represents a fundamental research direction as the field approaches data constraints, with the bug potentially residing in backpropagation, architecture, or learning algorithms.
✓Transformer Architecture Persistence: Self-attention will likely remain central to AGI systems despite eight years since the original paper. Attempts to remove or simplify attention consistently fail unless at least one attention layer remains. The architecture serves as the interface between learning algorithms and tokens, with any paradigm shift requiring changes to the entire learning system including backpropagation, not just architectural modifications.

What It Covers

Yi Tay, co-captain of Google DeepMind's IMO Gold effort and leader of the Singapore Gemini team, discusses the decision to abandon AlphaProof's symbolic system for end-to-end Gemini models, on-policy reinforcement learning philosophy, the one-week training process for IMO models, and why data efficiency and world models represent the next frontier in AI research beyond pure scaling.

Key Questions Answered

•On-Policy vs Off-Policy RL: On-policy learning means models generate their own outputs, receive rewards based on those generations, then train on their own trajectories rather than imitating others' successful paths. This approach proves more generalizable than supervised fine-tuning because models learn from first principles by making mistakes and receiving feedback, similar to how humans learn through trial and error rather than pure imitation of expert demonstrations.
•IMO Gold Architecture Decision: DeepMind made the bold choice to completely abandon AlphaProof's specialized symbolic system and use end-to-end Gemini models for IMO 2024. The training process took approximately one week with four co-captains across London, Mountain View, and Singapore coordinating in different time zones. This live competition format with problems released on different days created more adrenaline than standard benchmark optimization since gold medal thresholds depend on human participant performance.
•Reasoning as Post-Training RL: The technical definition of reasoning in modern LLMs centers on using reinforcement learning during post-training to elicit better thinking capabilities. This involves training models to improve with extended thinking time, whether through discrete token chain-of-thought or latent space representations. The field has moved beyond pure architecture innovation toward RL as the primary modeling toolset for capability improvements at the frontier.
•Data Efficiency Gap: Humans demonstrate eight orders of magnitude better data efficiency than current models - a two-year-old child shows more capability than LLMs after seeing vastly less data. The solution likely involves spending more compute per token during training rather than just scaling dataset size. This represents a fundamental research direction as the field approaches data constraints, with the bug potentially residing in backpropagation, architecture, or learning algorithms.
•Transformer Architecture Persistence: Self-attention will likely remain central to AGI systems despite eight years since the original paper. Attempts to remove or simplify attention consistently fail unless at least one attention layer remains. The architecture serves as the interface between learning algorithms and tokens, with any paradigm shift requiring changes to the entire learning system including backpropagation, not just architectural modifications.
•AI Coding Productivity Shift: AI coding tools crossed an emergent threshold in 2024 where researchers can paste bugs directly into systems like Anthropic's Claude without examining them, receive fixes, and relaunch jobs with high success rates. This represents "vibe training" beyond basic code generation - the model investigates and solves problems the researcher doesn't fully understand. Time saved per bug can reach full workdays, creating passive productivity buffs across entire teams.
•Geographic Research Strategy: Singapore's advantage for frontier AI research lies in being far enough from Bay Area culture for mental space while maintaining connectivity. The timezone enables 24-hour job coverage when coordinating with London and Mountain View teams. Talent density matters more than team size, with hiring focused on exceptional RL research track records or competitive programming achievements rather than rapid scaling.

Notable Moment

Tay reveals he knew nothing about the IMO competition itself and only trained the model checkpoint used for the live event. Team members flew to Australia to receive problems as they released, running inference in real-time while others gathered in London for a hackathon-style coordination. The gold medal threshold depended on human participant scores, creating uncertainty until final verification by the IMO committee.

Know someone who'd find this useful?

You just read a 3-minute summary of a 89-minute episode.

Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Apr 23 · 54 min

The Mel Robbins Podcast

Do THIS Every Day to Rewire Your Brain From Stress and Anxiety

Apr 27

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

Apr 22 · 72 min

The Model Health Show

The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow

Apr 27

Similar Episodes

Related episodes from other podcasts

The Mel Robbins Podcast

Apr 27

685: David Epstein - The Freedom Trap, Narrative Values, General Magic, The Nobel Prize Winner Who Simplified Everything, Wearing the Same Thing Everyday, and Why Constraints Are the Secret to Your Best Work

The AI Breakdown

Apr 26

Where the Economy Thrives After AI

Explore Related Topics

🌐Economics & Policy

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

You're clearly into Latent Space.

Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime

Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Do THIS Every Day to Rewire Your Brain From Stress and Anxiety

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow

More from Latent Space

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

Extreme Harness Engineering for Token Billionaires: 1M LOC, 1B toks/day, 0% human code, 0% human review — Ryan Lopopolo, OpenAI Frontier & Symphony

Similar Episodes

Do THIS Every Day to Rewire Your Brain From Stress and Anxiety

The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow

664. Britain in the 70s: Scandal in Downing Street (Part 3)

685: David Epstein - The Freedom Trap, Narrative Values, General Magic, The Nobel Prize Winner Who Simplified Everything, Wearing the Same Thing Everyday, and Why Constraints Are the Secret to Your Best Work

Where the Economy Thrives After AI

Explore Related Topics

You're clearly into Latent Space.