Skip to main content
YT

Yi Tay

2episodes
1podcast

We have 2 summarized appearances for Yi Tay so far. Browse all podcasts to discover more episodes.

Featured On 1 Podcast

All Appearances

2 episodes

AI Summary

→ WHAT IT COVERS Yi Tay, co-captain of Google DeepMind's IMO Gold effort and leader of the Singapore Gemini team, discusses the decision to abandon AlphaProof's symbolic system for end-to-end Gemini models, on-policy reinforcement learning philosophy, the one-week training process for IMO models, and why data efficiency and world models represent the next frontier in AI research beyond pure scaling. → KEY INSIGHTS - **On-Policy vs Off-Policy RL:** On-policy learning means models generate their own outputs, receive rewards based on those generations, then train on their own trajectories rather than imitating others' successful paths. This approach proves more generalizable than supervised fine-tuning because models learn from first principles by making mistakes and receiving feedback, similar to how humans learn through trial and error rather than pure imitation of expert demonstrations. - **IMO Gold Architecture Decision:** DeepMind made the bold choice to completely abandon AlphaProof's specialized symbolic system and use end-to-end Gemini models for IMO 2024. The training process took approximately one week with four co-captains across London, Mountain View, and Singapore coordinating in different time zones. This live competition format with problems released on different days created more adrenaline than standard benchmark optimization since gold medal thresholds depend on human participant performance. - **Reasoning as Post-Training RL:** The technical definition of reasoning in modern LLMs centers on using reinforcement learning during post-training to elicit better thinking capabilities. This involves training models to improve with extended thinking time, whether through discrete token chain-of-thought or latent space representations. The field has moved beyond pure architecture innovation toward RL as the primary modeling toolset for capability improvements at the frontier. - **Data Efficiency Gap:** Humans demonstrate eight orders of magnitude better data efficiency than current models - a two-year-old child shows more capability than LLMs after seeing vastly less data. The solution likely involves spending more compute per token during training rather than just scaling dataset size. This represents a fundamental research direction as the field approaches data constraints, with the bug potentially residing in backpropagation, architecture, or learning algorithms. - **Transformer Architecture Persistence:** Self-attention will likely remain central to AGI systems despite eight years since the original paper. Attempts to remove or simplify attention consistently fail unless at least one attention layer remains. The architecture serves as the interface between learning algorithms and tokens, with any paradigm shift requiring changes to the entire learning system including backpropagation, not just architectural modifications. - **AI Coding Productivity Shift:** AI coding tools crossed an emergent threshold in 2024 where researchers can paste bugs directly into systems like Anthropic's Claude without examining them, receive fixes, and relaunch jobs with high success rates. This represents "vibe training" beyond basic code generation - the model investigates and solves problems the researcher doesn't fully understand. Time saved per bug can reach full workdays, creating passive productivity buffs across entire teams. - **Geographic Research Strategy:** Singapore's advantage for frontier AI research lies in being far enough from Bay Area culture for mental space while maintaining connectivity. The timezone enables 24-hour job coverage when coordinating with London and Mountain View teams. Talent density matters more than team size, with hiring focused on exceptional RL research track records or competitive programming achievements rather than rapid scaling. → NOTABLE MOMENT Tay reveals he knew nothing about the IMO competition itself and only trained the model checkpoint used for the live event. Team members flew to Australia to receive problems as they released, running inference in real-time while others gathered in London for a hackathon-style coordination. The gold medal threshold depended on human participant scores, creating uncertainty until final verification by the IMO committee. 💼 SPONSORS None detected 🏷️ Reinforcement Learning, IMO Competition, Gemini Models, On-Policy Training, Data Efficiency, Transformer Architecture, AI Research Singapore

AI Summary

→ WHAT IT COVERS Yi Tay, co-captain of Google DeepMind's IMO Gold achievement, discusses the decision to abandon AlphaProof's specialized system for an end-to-end Gemini model, the shift from off-policy to on-policy reinforcement learning, training models in one week across four time zones, and establishing DeepMind's Reasoning and AGI team in Singapore while reflecting on architecture debates and data efficiency challenges. → KEY INSIGHTS - **On-Policy vs Off-Policy RL:** On-policy learning means models generate their own outputs and train on those trajectories with reward signals, similar to humans making mistakes and learning from feedback. Off-policy learning mimics other models' successful trajectories through supervised fine-tuning. On-policy proves more generalizable because models learn from first principles rather than copying, though the science comparing both approaches remains incomplete. This philosophical shift drives modern LLM reinforcement learning. - **IMO Gold Training Process:** Four co-captains across London, Mountain View, and Singapore trained the IMO model in approximately one week, coordinating across time zones by passing work between shifts. The team abandoned AlphaProof's specialized symbolic system for pure end-to-end Gemini, betting that general-purpose models with sufficient parameters can subsume specialized tools. The live competition format meant no benchmark climbing—teams received problems in real-time and ran inference immediately, creating genuine uncertainty about achieving gold. - **Learning Rate Adjustment Strategy:** When encountering one counterexample that breaks your worldview after ten years of consistent experience, update your beliefs by 20-50%, not 2%. Most people update too slowly when proven wrong, being overly Bayesian when they should recognize their entire prior model is invalid. This applies to both human learning and AI research—when Stable Diffusion emerged, it required complete mental model revision, not incremental adjustment. - **Reasoning as RL Post-Training:** Reasoning now means using reinforcement learning and post-training to elicit better thinking capabilities from models. It encompasses chain-of-thought tokens, thinking traces, and trajectory optimization. The technical definition: making models better with thinking through RL, whether discrete token thinking or latent thinking in vector space. This represents a shift from viewing reasoning as mysterious to treating it as systematic capability improvement through specific training methods. - **Data Efficiency Bottleneck:** Humans achieve vastly higher data efficiency than models—a two-year-old recognizes dogs from three examples while models need thousands. The solution involves spending more FLOPS per token to extract maximum learning from limited data, not just adding more data. With internet data becoming finite, research focuses on better learning algorithms and architectures that can squeeze more capability from each training token, potentially through world model approaches or improved backpropagation methods. - **AI Coding Productivity Gains:** Models now fix bugs without human inspection—paste error messages into tools like Anthropic's Claude, apply suggested fixes, and relaunch jobs. This surpasses basic "vibe coding" where you know the solution but want automation. Models investigate and solve problems faster than manual debugging would take, sometimes saving full workdays. The shift represents moving from AI as occasional helper to AI as automatic debugging partner for high-expertise ML work. - **Geographic Distribution Strategy:** DeepMind Singapore operates with 24-hour coverage across London, Mountain View, and Singapore time zones, enabling continuous model training and debugging. Geography matters for talent density and cultural diversity—some researchers refuse Bay Area relocation but join Singapore or London teams. The region provides "far enough for peace and quiet, close enough to stay connected" while accessing strong local talent pools attracted by frontier AGI research opportunities. → NOTABLE MOMENT Tay reveals that when the IMO model checkpoint was delivered for the live competition, he knew nothing about IMO mathematics and couldn't solve the problems himself. He watched human participants' scores to determine if Gemini would achieve gold, since the medal threshold depends on a bell curve of human performance. The outcome remained genuinely uncertain until results arrived, creating real adrenaline unlike standard benchmark climbing. 💼 SPONSORS None detected 🏷️ Reinforcement Learning, IMO Mathematics, Model Architecture, DeepMind Singapore, On-Policy Training, AI Coding Tools, Data Efficiency

Explore More

Never miss Yi Tay's insights

Subscribe to get AI-powered summaries of Yi Tay's podcast appearances delivered to your inbox weekly.

Start Free Today

No credit card required • Free tier available