The Model Eats the Scaffolding: DeepMind's Logan Kilpatrick & Tulsee Doshi on 3.5 Flash, Omni & More
Episode
59 min
Read time
2 min
AI-Generated Summary
Key Takeaways
- ✓Model-Harness Co-training: Google now trains Gemini models in direct partnership with its "anti-gravity" agent harness, meaning the model is optimized for tool-calling loops, orchestration, and agentic workflows from the start. Teams building on top inherit this infrastructure without rebuilding it, cutting iteration cycles from months to days and enabling faster product launches across all Google surfaces.
- ✓Flash-First Strategy: Google deliberately prioritizes cost-adjusted performance over raw capability maximization. Gemini 3.5 Flash benchmarks at roughly 280 tokens per second on Artificial Analysis, runs three times faster than comparable large models, and costs significantly less. For consumer-scale products like Search and the Gemini app, latency improvements outperform quality gains in live experiments when users are unwilling to wait.
- ✓"Model Eats the Scaffolding" Cycle: Every 12–18 months, the surrounding scaffolding that developers build around AI models gets absorbed into the model itself. Product teams should avoid having every team independently rebuild agentic infrastructure from scratch. Standardizing on a shared harness layer reduces redundant engineering, accelerates deployment, and surfaces model failure modes faster through centralized feedback loops.
- ✓Context Window Economics: Context windows have plateaued near one million tokens because serving costs become prohibitive at scale — a single one-million-token request can cost several dollars. Google's direction shifts toward smart context compaction: selectively retrieving relevant information rather than expanding raw window size, which keeps latency and cost manageable while effectively giving models access to much larger information pools.
- ✓Recursive Self-Improvement — Practical, Not Singular: Google uses Gemini internally to improve Gemini, including running ablations, submitting code changes, and generating research reports autonomously. However, humans remain in the driver's seat on large pre-training runs due to the high compute cost of misdirection. The framing is deep human-AI collaboration rather than autonomous AI-led model development, with humans focused on strategic interpretation of results.
What It Covers
Google DeepMind's Logan Kilpatrick and Tulsee Doshi join host Neil Savage at Google HQ ahead of Google IO to discuss Gemini 3.5 Flash, the Omni video generation model, the Agent harness infrastructure, recursive self-improvement, context window limits, and Google's overall AI product strategy across its billions-of-users product surface.
Key Questions Answered
- •Model-Harness Co-training: Google now trains Gemini models in direct partnership with its "anti-gravity" agent harness, meaning the model is optimized for tool-calling loops, orchestration, and agentic workflows from the start. Teams building on top inherit this infrastructure without rebuilding it, cutting iteration cycles from months to days and enabling faster product launches across all Google surfaces.
- •Flash-First Strategy: Google deliberately prioritizes cost-adjusted performance over raw capability maximization. Gemini 3.5 Flash benchmarks at roughly 280 tokens per second on Artificial Analysis, runs three times faster than comparable large models, and costs significantly less. For consumer-scale products like Search and the Gemini app, latency improvements outperform quality gains in live experiments when users are unwilling to wait.
- •"Model Eats the Scaffolding" Cycle: Every 12–18 months, the surrounding scaffolding that developers build around AI models gets absorbed into the model itself. Product teams should avoid having every team independently rebuild agentic infrastructure from scratch. Standardizing on a shared harness layer reduces redundant engineering, accelerates deployment, and surfaces model failure modes faster through centralized feedback loops.
- •Context Window Economics: Context windows have plateaued near one million tokens because serving costs become prohibitive at scale — a single one-million-token request can cost several dollars. Google's direction shifts toward smart context compaction: selectively retrieving relevant information rather than expanding raw window size, which keeps latency and cost manageable while effectively giving models access to much larger information pools.
- •Recursive Self-Improvement — Practical, Not Singular: Google uses Gemini internally to improve Gemini, including running ablations, submitting code changes, and generating research reports autonomously. However, humans remain in the driver's seat on large pre-training runs due to the high compute cost of misdirection. The framing is deep human-AI collaboration rather than autonomous AI-led model development, with humans focused on strategic interpretation of results.
Notable Moment
A Google safety and alignment researcher ran a full suite of model ablations from her phone while sitting in a hot tub, receiving a complete report within an hour. The anecdote illustrates how AI-assisted research workflows have already compressed tasks that previously required days of engineering time into single sessions.
You just read a 3-minute summary of a 56-minute episode.
Get Cognitive Revolution summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Cognitive Revolution
All Compute Is Food: Palisade's Jeffrey Ladish on AI Shutdown Resistance, Self-Replication & Ecology
May 24 · 133 min
Animal Spirits
Talk Your Book: Investing in the Rise of the Robots
May 25
More from Cognitive Revolution
Three Kinds of Software Survive: Tasklet's Andrew Lee on Competing to be a Horizontal Platform
May 15 · 93 min
Capital Allocators
Fundraising Mastery: The Tao of Kimmer – John Kim (EP.503)
May 25
More from Cognitive Revolution
We summarize every new episode. Want them in your inbox?
All Compute Is Food: Palisade's Jeffrey Ladish on AI Shutdown Resistance, Self-Replication & Ecology
Three Kinds of Software Survive: Tasklet's Andrew Lee on Competing to be a Horizontal Platform
Milliseconds to Match: Criteo's AdTech AI & the Future of Commerce w/ Diarmuid Gill & Liva Ralaivola
"Descript Isn't a Slop Machine": Laura Burkhauser on the AI Tools Creators Love and Hate
The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking
Similar Episodes
Related episodes from other podcasts
Animal Spirits
May 25
Talk Your Book: Investing in the Rise of the Robots
Capital Allocators
May 25
Fundraising Mastery: The Tao of Kimmer – John Kim (EP.503)
The Productivity Show
May 25
The Productivity Stack: Apps and Tools We Actually Use Every Day (TPS614)
The Diary of a CEO
May 25
Bruno Fernandes: Roy Keane Twisted My Words. They Offered Me £200M, I Said No.
The Model Health Show
May 25
66% of Chronic Back Pain CURED: The Groundbreaking Study Changing Medicine – With Dr. Howard Schubiner
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
You're clearly into Cognitive Revolution.
Every Monday, we deliver AI summaries of the latest episodes from Cognitive Revolution and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime