
[State of RL/Reasoning] IMO/IOI Gold, OpenAI o3/GPT-5, and Cursor Composer — Ashvin Nair, Cursor
Latent SpaceAI Summary
→ WHAT IT COVERS Ashvin Nair, former OpenAI reasoning team member now at Cursor, discusses the transition from robotics to language models, the development of OpenAI's o1/o3 reasoning models with a 300-person team, achieving IMO/IOI gold medals, and Cursor's approach to co-designing products with models through rapid RL iteration cycles every two hours. → KEY INSIGHTS - **Reasoning Model Development Scale:** OpenAI's o1 reasoning model started with approximately 12 core people but expanded to 50-100 contributors for the initial release and eventually 300 people for o3. The breakthrough came in 2023 when RL applied to smaller pretrained models produced surprisingly accurate reasoning traces on math problems, demonstrating capabilities unachievable through additional pretraining alone, leading to full-scale investment. - **RL Generalization Limitations:** Reinforcement learning for language models excels at dominating training distributions but generalizes poorly beyond them. The solution requires bringing economically useful tasks into the training distribution rather than expecting broad generalization. This means products must capture complete user context including code repositories, terminal access, conversation history, and workflow data to enable effective RL training on real-world tasks. - **Robotics Market Timing:** Language model agents represent a trillion-dollar market opportunity before robotics reaches even ten billion dollars in value. Current AI robotics sits at the GPT-1 to GPT-2 development stage, showing hints of generalization but lacking reliable out-of-distribution performance. The technology requires demonstrable value creation before unit economics can work, including maintenance costs and reliability thresholds for commercial deployment. - **Continual Learning Gap:** Models trained on trillions of tokens should theoretically handle millions of deployment tokens without capacity constraints, yet they repeatedly make identical mistakes within and across contexts. The field needs breakthroughs in continual learning that enable models to permanently learn from single experiences, similar to humans avoiding hot stoves after one touch, rather than requiring explicit data curation and filtering. - **Product-Model Co-Design:** Cursor's 20-25 person ML team ships competitive models by tightly integrating product and model development. Their Composer model balances intelligence with speed to keep programmers in flow state, avoiding context-switching from slow inference. Internal tooling enables SSH sessions into user environments for direct data inspection, and policy updates occur every two hours, impossible at larger organizations with separated product and research teams. → NOTABLE MOMENT Nair reveals that when attending The Curve conference before o1's release, attendees predicted 20% performance on math benchmarks by 2027, yet OpenAI already had models exceeding those estimates internally. The same forecasters predicting Dyson spheres by 2035 were simultaneously underestimating near-term capabilities by multiple years, demonstrating systematic miscalibration in AI progress predictions. 💼 SPONSORS None detected 🏷️ Reinforcement Learning, AI Reasoning Models, Code Generation, Continual Learning, Product-Model Co-Design