
AI Summary
→ WHAT IT COVERS Arash Selvan and Mukund Sridhar, the PM and tech lead behind Gemini Deep Research, explain how they built the original deep research agent category. They cover the technical architecture, including custom fine-tuned models, asynchronous orchestration systems, iterative planning mechanisms, and evaluation strategies for five-minute autonomous research tasks that browse dozens of websites. → KEY INSIGHTS - **Editable Research Plans:** Deep Research generates an upfront research plan showing exactly how it will break down the query before starting. Users can edit this plan conversationally or via button, though most hit start immediately. This transparency mechanism helps users understand the approach even when they don't engage, addressing the challenge of spending five minutes on potentially misaligned research directions. - **Custom Post-Training Required:** The team built a specialized fine-tuned version of Gemini 1.5 Pro specifically for deep research, not just the base model. This post-training work teaches iterative planning across domains without overfitting per vertical. The challenge involves balancing new capabilities while preserving pre-training knowledge, using data augmentation techniques to maintain generalizability across their research ontology. - **Asynchronous Orchestration Platform:** Google built a new async engine enabling users to close computers and receive notifications when research completes. The system maintains state, handles retries on failures, and manages hundreds of LLM calls reliably. This differs from previous synchronous chat interactions and resembles workflow systems like Temporal or Apache Airflow but optimized for multi-minute agent jobs. - **Context Over RAG Strategy:** Deep Research keeps all browsed websites in the full context window (up to two million tokens) rather than using retrieval augmented generation. RAG struggles when queries have multiple attributes since cosine similarity doesn't work well. The team only falls back to RAG when context exceeds limits or for conversations beyond 10 turns ago, prioritizing recent research for complex follow-up questions. - **Ontology-Based Evaluation:** Instead of vertical-specific benchmarks, the team developed a research behavior ontology spanning broad-shallow queries (like finding summer camps) to narrow-deep investigations. They combine automated metrics (plan length, iteration steps, time distribution) with human evaluation on comprehensiveness and groundedness. Standard benchmarks don't translate to product experience since text output entropy makes verification challenging. - **Counterintuitive Latency Preferences:** Users actually value longer research times, contrary to all Google product orthodoxy where latency improvements always increased satisfaction and retention. The team initially worried about five-minute waits and built a hard 10-minute limit, but users appreciate visible work being done across 30-70 websites. This inverts traditional product metrics where faster always performed better. → NOTABLE MOMENT The team discovered users suspected they were artificially inflating wait times when investor Jason Calacanis asked if they generated answers in 10 seconds then made users wait. This completely contradicted their assumptions since every Google product historically showed latency improvements drove all other metrics up, leading them to initially build both five-minute and 15-minute hardcore versions. 💼 SPONSORS None detected 🏷️ Agent Engineering, LLM Post-Training, Research Automation, Workflow Orchestration, Product Evaluation, AI Transparency