Skip to main content
Latent Space

Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay

92 min episode · 3 min read
·

Episode

92 min

Read time

3 min

Topics

Economics & Policy

AI-Generated Summary

Key Takeaways

  • On-Policy vs Off-Policy RL: On-policy learning means models generate their own outputs, receive rewards based on those generations, then train on their own trajectories rather than imitating others' successful paths. This approach proves more generalizable than supervised fine-tuning because models learn from first principles by making mistakes and receiving feedback, similar to how humans learn through trial and error rather than pure imitation of expert demonstrations.
  • IMO Gold Architecture Decision: DeepMind made the bold choice to completely abandon AlphaProof's specialized symbolic system and use end-to-end Gemini models for IMO 2024. The training process took approximately one week with four co-captains across London, Mountain View, and Singapore coordinating in different time zones. This live competition format with problems released on different days created more adrenaline than standard benchmark optimization since gold medal thresholds depend on human participant performance.
  • Reasoning as Post-Training RL: The technical definition of reasoning in modern LLMs centers on using reinforcement learning during post-training to elicit better thinking capabilities. This involves training models to improve with extended thinking time, whether through discrete token chain-of-thought or latent space representations. The field has moved beyond pure architecture innovation toward RL as the primary modeling toolset for capability improvements at the frontier.
  • Data Efficiency Gap: Humans demonstrate eight orders of magnitude better data efficiency than current models - a two-year-old child shows more capability than LLMs after seeing vastly less data. The solution likely involves spending more compute per token during training rather than just scaling dataset size. This represents a fundamental research direction as the field approaches data constraints, with the bug potentially residing in backpropagation, architecture, or learning algorithms.
  • Transformer Architecture Persistence: Self-attention will likely remain central to AGI systems despite eight years since the original paper. Attempts to remove or simplify attention consistently fail unless at least one attention layer remains. The architecture serves as the interface between learning algorithms and tokens, with any paradigm shift requiring changes to the entire learning system including backpropagation, not just architectural modifications.

What It Covers

Yi Tay, co-captain of Google DeepMind's IMO Gold effort and leader of the Singapore Gemini team, discusses the decision to abandon AlphaProof's symbolic system for end-to-end Gemini models, on-policy reinforcement learning philosophy, the one-week training process for IMO models, and why data efficiency and world models represent the next frontier in AI research beyond pure scaling.

Key Questions Answered

  • On-Policy vs Off-Policy RL: On-policy learning means models generate their own outputs, receive rewards based on those generations, then train on their own trajectories rather than imitating others' successful paths. This approach proves more generalizable than supervised fine-tuning because models learn from first principles by making mistakes and receiving feedback, similar to how humans learn through trial and error rather than pure imitation of expert demonstrations.
  • IMO Gold Architecture Decision: DeepMind made the bold choice to completely abandon AlphaProof's specialized symbolic system and use end-to-end Gemini models for IMO 2024. The training process took approximately one week with four co-captains across London, Mountain View, and Singapore coordinating in different time zones. This live competition format with problems released on different days created more adrenaline than standard benchmark optimization since gold medal thresholds depend on human participant performance.
  • Reasoning as Post-Training RL: The technical definition of reasoning in modern LLMs centers on using reinforcement learning during post-training to elicit better thinking capabilities. This involves training models to improve with extended thinking time, whether through discrete token chain-of-thought or latent space representations. The field has moved beyond pure architecture innovation toward RL as the primary modeling toolset for capability improvements at the frontier.
  • Data Efficiency Gap: Humans demonstrate eight orders of magnitude better data efficiency than current models - a two-year-old child shows more capability than LLMs after seeing vastly less data. The solution likely involves spending more compute per token during training rather than just scaling dataset size. This represents a fundamental research direction as the field approaches data constraints, with the bug potentially residing in backpropagation, architecture, or learning algorithms.
  • Transformer Architecture Persistence: Self-attention will likely remain central to AGI systems despite eight years since the original paper. Attempts to remove or simplify attention consistently fail unless at least one attention layer remains. The architecture serves as the interface between learning algorithms and tokens, with any paradigm shift requiring changes to the entire learning system including backpropagation, not just architectural modifications.
  • AI Coding Productivity Shift: AI coding tools crossed an emergent threshold in 2024 where researchers can paste bugs directly into systems like Anthropic's Claude without examining them, receive fixes, and relaunch jobs with high success rates. This represents "vibe training" beyond basic code generation - the model investigates and solves problems the researcher doesn't fully understand. Time saved per bug can reach full workdays, creating passive productivity buffs across entire teams.
  • Geographic Research Strategy: Singapore's advantage for frontier AI research lies in being far enough from Bay Area culture for mental space while maintaining connectivity. The timezone enables 24-hour job coverage when coordinating with London and Mountain View teams. Talent density matters more than team size, with hiring focused on exceptional RL research track records or competitive programming achievements rather than rapid scaling.

Notable Moment

Tay reveals he knew nothing about the IMO competition itself and only trained the model checkpoint used for the live event. Team members flew to Australia to receive problems as they released, running inference in real-time while others gathered in London for a hackathon-style coordination. The gold medal threshold depended on human participant scores, creating uncertainty until final verification by the IMO committee.

Know someone who'd find this useful?

You just read a 3-minute summary of a 89-minute episode.

Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

More from Latent Space

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

Explore Related Topics

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

You're clearly into Latent Space.

Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime