The System Behind Self-Driving: Waymo’s Dmitri Dolgov

April 17, 2026

64 min episode · 3 min read

Dmitri Dolgov

Episode

64 min

Read time

3 min

AI-Generated Summary

Published Apr 17, 2026

Key Takeaways

✓Sensor Fusion Architecture: Waymo uses three complementary sensing modalities — cameras, LiDAR, and radar — each with 360-degree coverage. Rather than switching between sensors, all three feed separate encoders that jointly produce a unified world model. Radar excels in fog and heavy rain where cameras degrade; LiDAR provides high-resolution 3D structure. No real-time cloud dependency exists; all safety-critical inference runs locally onboard the vehicle.
✓Foundation Model Distillation Pipeline: Waymo builds one large off-board foundation model, then specializes it into three onboard "teacher" models — the Driver, the Simulator, and the Critic. Each teacher distills a smaller, faster "student" model deployable on the vehicle. This architecture enables closed-loop reinforcement learning fine-tuning, realistic synthetic environment generation, and automated behavioral evaluation without requiring pixel-level simulation throughout the entire training pipeline.
✓Full Autonomy vs. Driver Assist — A Qualitative Gap: Dolgov argues that driver-assist systems and full autonomy are fundamentally different engineering problems, not points on a single spectrum. A basic vision-language model fine-tuned on trajectories can handle nominal driving but falls orders of magnitude short of the safety threshold required for driverless operation. Reaching full autonomy requires the Simulator and Critic infrastructure that driver-assist development never demands, making incremental convergence from Level 2 upward practically implausible.
✓Generation 6 Hardware Cost Reduction: Waymo's sixth-generation sensor stack costs a fraction of the fifth generation — comparable to a premium ADAS system — through unification and simplification across all three modalities. The driving software stack transfers largely unchanged across hardware generations and vehicle platforms, including the upcoming Hyundai Ioniq deployment. LiDAR, radar, and camera component costs follow predictable downward trends as automotive supply chains mature and manufacturing volumes increase.
✓Scaling Signals and City Expansion Velocity: Waymo operates 3,000 vehicles across 11 U.S. cities, generating roughly 4 million fully autonomous miles per week. The company launched riders in four new cities simultaneously in a single day — a milestone that took eight years to achieve from first autonomous passenger operation in Chandler, Arizona in 2020. London and Tokyo deployments are planned for 2025, with the core technology generalizing well to new geographies with targeted data collection and validation work.

What It Covers

Waymo co-CEO Dmitri Dolgov explains the technical architecture behind 500,000 weekly autonomous rides, covering the sensor fusion stack, the foundation model distillation pipeline, why driver-assist systems cannot incrementally evolve into full autonomy, and how Generation 6 hardware cuts costs to levels comparable to premium ADAS systems while enabling accelerated global deployment.

Key Questions Answered

•Sensor Fusion Architecture: Waymo uses three complementary sensing modalities — cameras, LiDAR, and radar — each with 360-degree coverage. Rather than switching between sensors, all three feed separate encoders that jointly produce a unified world model. Radar excels in fog and heavy rain where cameras degrade; LiDAR provides high-resolution 3D structure. No real-time cloud dependency exists; all safety-critical inference runs locally onboard the vehicle.
•Foundation Model Distillation Pipeline: Waymo builds one large off-board foundation model, then specializes it into three onboard "teacher" models — the Driver, the Simulator, and the Critic. Each teacher distills a smaller, faster "student" model deployable on the vehicle. This architecture enables closed-loop reinforcement learning fine-tuning, realistic synthetic environment generation, and automated behavioral evaluation without requiring pixel-level simulation throughout the entire training pipeline.
•Full Autonomy vs. Driver Assist — A Qualitative Gap: Dolgov argues that driver-assist systems and full autonomy are fundamentally different engineering problems, not points on a single spectrum. A basic vision-language model fine-tuned on trajectories can handle nominal driving but falls orders of magnitude short of the safety threshold required for driverless operation. Reaching full autonomy requires the Simulator and Critic infrastructure that driver-assist development never demands, making incremental convergence from Level 2 upward practically implausible.
•Generation 6 Hardware Cost Reduction: Waymo's sixth-generation sensor stack costs a fraction of the fifth generation — comparable to a premium ADAS system — through unification and simplification across all three modalities. The driving software stack transfers largely unchanged across hardware generations and vehicle platforms, including the upcoming Hyundai Ioniq deployment. LiDAR, radar, and camera component costs follow predictable downward trends as automotive supply chains mature and manufacturing volumes increase.
•Scaling Signals and City Expansion Velocity: Waymo operates 3,000 vehicles across 11 U.S. cities, generating roughly 4 million fully autonomous miles per week. The company launched riders in four new cities simultaneously in a single day — a milestone that took eight years to achieve from first autonomous passenger operation in Chandler, Arizona in 2020. London and Tokyo deployments are planned for 2025, with the core technology generalizing well to new geographies with targeted data collection and validation work.
•Emergent AI Behavior as a Capability Signal: A concrete example of emergent model capability occurred when a Waymo vehicle detected a pedestrian obscured behind a bus using peripheral LiDAR returns bouncing beneath the vehicle chassis — a detection method no engineer explicitly programmed. This type of emergent behavior, enabled by intermediate world representations rather than pure pixel-to-trajectory end-to-end models, signals that the foundation model approach produces capabilities that exceed explicit engineering specifications.

Notable Moment

Dolgov describes watching a Waymo vehicle detect a pedestrian hidden entirely behind a bus and respond correctly — then discovering the system had used faint LiDAR reflections bouncing under the bus chassis to infer the person's presence and predict their movement. No engineer designed this behavior; the model derived it independently from training.

Know someone who'd find this useful?