
Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay
Latent SpaceAI Summary
→ WHAT IT COVERS Yi Tay, co-captain of Google DeepMind's IMO Gold effort and leader of the Singapore Gemini team, discusses the decision to abandon AlphaProof's symbolic system for end-to-end Gemini models, on-policy reinforcement learning philosophy, the one-week training process for IMO models, and why data efficiency and world models represent the next frontier in AI research beyond pure scaling. → KEY INSIGHTS - **On-Policy vs Off-Policy RL:** On-policy learning means models generate their own outputs, receive rewards based on those generations, then train on their own trajectories rather than imitating others' successful paths. This approach proves more generalizable than supervised fine-tuning because models learn from first principles by making mistakes and receiving feedback, similar to how humans learn through trial and error rather than pure imitation of expert demonstrations. - **IMO Gold Architecture Decision:** DeepMind made the bold choice to completely abandon AlphaProof's specialized symbolic system and use end-to-end Gemini models for IMO 2024. The training process took approximately one week with four co-captains across London, Mountain View, and Singapore coordinating in different time zones. This live competition format with problems released on different days created more adrenaline than standard benchmark optimization since gold medal thresholds depend on human participant performance. - **Reasoning as Post-Training RL:** The technical definition of reasoning in modern LLMs centers on using reinforcement learning during post-training to elicit better thinking capabilities. This involves training models to improve with extended thinking time, whether through discrete token chain-of-thought or latent space representations. The field has moved beyond pure architecture innovation toward RL as the primary modeling toolset for capability improvements at the frontier. - **Data Efficiency Gap:** Humans demonstrate eight orders of magnitude better data efficiency than current models - a two-year-old child shows more capability than LLMs after seeing vastly less data. The solution likely involves spending more compute per token during training rather than just scaling dataset size. This represents a fundamental research direction as the field approaches data constraints, with the bug potentially residing in backpropagation, architecture, or learning algorithms. - **Transformer Architecture Persistence:** Self-attention will likely remain central to AGI systems despite eight years since the original paper. Attempts to remove or simplify attention consistently fail unless at least one attention layer remains. The architecture serves as the interface between learning algorithms and tokens, with any paradigm shift requiring changes to the entire learning system including backpropagation, not just architectural modifications. - **AI Coding Productivity Shift:** AI coding tools crossed an emergent threshold in 2024 where researchers can paste bugs directly into systems like Anthropic's Claude without examining them, receive fixes, and relaunch jobs with high success rates. This represents "vibe training" beyond basic code generation - the model investigates and solves problems the researcher doesn't fully understand. Time saved per bug can reach full workdays, creating passive productivity buffs across entire teams. - **Geographic Research Strategy:** Singapore's advantage for frontier AI research lies in being far enough from Bay Area culture for mental space while maintaining connectivity. The timezone enables 24-hour job coverage when coordinating with London and Mountain View teams. Talent density matters more than team size, with hiring focused on exceptional RL research track records or competitive programming achievements rather than rapid scaling. → NOTABLE MOMENT Tay reveals he knew nothing about the IMO competition itself and only trained the model checkpoint used for the live event. Team members flew to Australia to receive problems as they released, running inference in real-time while others gathered in London for a hackathon-style coordination. The gold medal threshold depended on human participant scores, creating uncertainty until final verification by the IMO committee. 💼 SPONSORS None detected 🏷️ Reinforcement Learning, IMO Competition, Gemini Models, On-Policy Training, Data Efficiency, Transformer Architecture, AI Research Singapore