Sergey Levine

#331 Sergey Levine: The Robot Revolution Nobody Is Talking About

Apr 12, 202659 minFounder of Physical Intelligence, UC Berkeley Professor

AI Summary

→ WHAT IT COVERS Sergey Levine, co-founder of Physical Intelligence and UC Berkeley professor, explains how robotic foundation models work, why diverse real-world data outperforms simulation, how Vision Language Action models enable generalist robots, and what the path toward autonomous continual learning systems looks like over the next several years. → KEY INSIGHTS - **Cross-Embodiment Data Transfer:** Training robots on data from multiple platforms dramatically improves performance on new hardware. Physical Intelligence trained mobile robots using datasets where only 3% came from mobile platforms — the remaining 97% from static arms — yet the robots successfully navigated unseen home environments and completed kitchen cleanup tasks with broad generalization. - **RTX Project Benchmark:** In the 2023 Open X-Embodiment (RTX) project, a single generalist model trained across data from approximately 30 academic robotics labs outperformed each individual lab's specialized model by roughly 50% on their own tasks. This mirrors the earlier finding in NLP that generalist language models beat specialized models on domain-specific benchmarks. - **Generalist Models Outperform Specialists in Open Environments:** Even when a robot needs to perform one specific task, a generalist model produces better real-world results than a narrow specialist. Unpredictable variables — misaligned objects, foreign items on surfaces, damaged materials — appear constantly outside controlled settings, and only models trained on diverse scenarios handle these edge cases reliably. - **Layered Inference Architecture for On-Device Deployment:** The path to reliable on-device robot intelligence involves splitting inference by abstraction level. High-level semantic reasoning runs on cloud servers, while low-level motor control runs locally on smaller, faster models. This architecture naturally degrades gracefully when connectivity drops, with the robot relying on cached inferences and local reflexive responses. - **Language Feedback as a Scalable Training Signal:** Once a foundation model's low-level motor skills reach sufficient quality, verbal corrections — telling the robot what it did wrong in natural language — can improve policy without additional teleoperation. This works because language supervises the model's internal reasoning chain rather than raw actions, making it a lower-cost, scalable alternative to full human demonstration data. → NOTABLE MOMENT Levine challenges the assumption that world models and Vision Language Action models are fundamentally different approaches. He argues the real goal is a unified system that selects the appropriate level of abstraction — predictive, semantic, or reflexive — depending on the specific stage of a task, rather than treating these as competing paradigms. 💼 SPONSORS [{"name": "Modulate (Velma)", "url": "https://preview.modulate.ai"}] 🏷️ Robotic Foundation Models, Vision Language Action Models, Sim-to-Real Transfer, Reinforcement Learning, Embodied AI

Read Full Summary Listen

Sergey Levine - Building LLMs for the Physical World - [Invest Like the Best, EP.465]

Invest Like the Best with Patrick O'Shaughnessy

Mar 31, 202667 minCofounder and Researcher at Physical Intelligence

AI Summary

→ WHAT IT COVERS Sergey Levine, cofounder of Physical Intelligence, explains why building general-purpose robotic foundation models — systems that control any robot for any task — is more tractable than narrow domain-specific approaches, drawing direct parallels to how large language models outcompeted specialized NLP systems by leveraging broad, weakly-labeled data at scale. → KEY INSIGHTS - **Generality over specialization:** Building one robotic foundation model that handles all tasks and embodiments outperforms narrow specialists long-term, mirroring how LLMs defeated domain-specific NLP tools like machine translation systems. The key mechanism: broad data enables physical world understanding, which transfers across applications far more efficiently than rebuilding task-specific pipelines from scratch for each new robot deployment. - **Chain-of-thought unlocks robotic common sense:** Physical Intelligence's models use intermediate semantic reasoning before acting — a robot told to "clean the kitchen" first identifies which object to pick up, then moves. This chain-of-thought step activates web-scale pre-training knowledge to handle edge cases, shifting the bottleneck from low-level motor control to mid-level scene interpretation, which can be supervised with language alone. - **Coaching replaces teleoperation data:** Six months ago, Physical Intelligence discovered that labeling robot experiences with high-level semantic commands — without adding any new low-level action demonstrations — improved kitchen generalization. This means operators can improve robot performance simply by verbally coaching the system, dramatically reducing the cost and complexity of expanding a robot's capability to new environments. - **Reinforcement learning enables superhuman throughput:** After demonstrating a task via teleoperation, robots can practice autonomously and remove human-paced pauses. In cable-plugging tasks, the robot identified and eliminated all hesitation points, executing the task significantly faster than human operators. Reinforcement learning is the general mechanism; simpler speed-optimization tricks also work for throughput gains without full RL pipelines. - **Hardware costs dropped 40x in a decade:** Robot arm costs fell from roughly $400,000 for a PR2 in 2014 to approximately $3,000–$4,000 per arm today. This cost collapse, enabled by combining cheaper hardware with learning-based control that tolerates mechanical imprecision, makes broad experimentation practical. Traditional industrial control methods required high-precision hardware; foundation model approaches compensate for mechanical variability through learned adaptation. - **Moravec's Paradox defines the hardest remaining tasks:** Tasks humans perform effortlessly — interpersonal physical assistance, elderly care, infant care — will be the last robotic capabilities achieved, not because of motor complexity but because humans are evolutionarily optimized for them. Robots will handle well-defined chaotic environments like hotel rooms or restaurant kitchens before mastering open-ended human-interaction tasks where stakes are high and edge cases are unbounded. → NOTABLE MOMENT Levine describes running the "Robot Olympics" — a blogger's list of mundane tasks no robot could do, like using a plastic bag to pick up dog waste or washing a greasy pan — as an internal stress test of their task-onboarding pipeline. The system completed nearly every task without any task-specific development, demonstrating generalization in practice. 💼 SPONSORS [{"name": "Ramp", "url": "https://ramp.com/invest"}, {"name": "Rogo", "url": "https://rogo.ai/invest"}, {"name": "WorkOS", "url": "https://workos.com"}, {"name": "Vanta", "url": "https://vanta.com/invest"}, {"name": "Ridgeline", "url": "https://ridgeline.ai"}] 🏷️ Robotics Foundation Models, Reinforcement Learning, Physical AI, Embodied Intelligence, Robot Data Collection, General-Purpose Automation

Read Full Summary Listen

Featured On 2 Podcasts

Eye on AI

Invest Like the Best with Patrick O'Shaughnessy

Top resources Sergey Levine mentions

Physical Intelligence

All Appearances

#331 Sergey Levine: The Robot Revolution Nobody Is Talking About

AI Summary

Sergey Levine - Building LLMs for the Physical World - [Invest Like the Best, EP.465]

AI Summary

Explore More

Never miss Sergey Levine's insights