Proactive Agents for the Web with Devi Parikh - #756

November 19, 2025

56 min episode · 2 min read

Devi Parikh

Episode

56 min

Read time

2 min

AI-Generated Summary

Published Dec 31, 2025

Key Takeaways

✓Visual-based browser navigation: Training models on website screenshots rather than DOM information proves more reliable and generalizable across different sites, solving challenges like date pickers that plagued DOM-based approaches with constant edge cases requiring site-specific solutions.
✓Scouts architecture combines APIs and browser automation: The system uses 80-90 MCP servers for structured data access but spins up remote browsers with custom-trained navigator models for information behind forms, optimizing for coverage first then precision in user-facing reports.
✓Post-training progression maximizes model capability: Utori trains QwQ models through supervised fine-tuning, then rejection sampling, then reinforcement learning to achieve reliable browser automation while keeping costs lower than using third-party API providers for their production workloads.
✓Background agents require hierarchical tool management: When orchestrating 80-90 tools, reliability breaks down if all tools are available simultaneously. Sub-agents with access to specific tool subsets enable scalable multi-agent workflows that adapt based on real-time web information.

What It Covers

Devi Parikh, co-founder of Utori, explains how AI browser agents will replace manual web interactions through proactive monitoring and automation, starting with Scouts, their product that monitors websites for user-specified information changes.

Key Questions Answered

•Visual-based browser navigation: Training models on website screenshots rather than DOM information proves more reliable and generalizable across different sites, solving challenges like date pickers that plagued DOM-based approaches with constant edge cases requiring site-specific solutions.
•Scouts architecture combines APIs and browser automation: The system uses 80-90 MCP servers for structured data access but spins up remote browsers with custom-trained navigator models for information behind forms, optimizing for coverage first then precision in user-facing reports.
•Post-training progression maximizes model capability: Utori trains QwQ models through supervised fine-tuning, then rejection sampling, then reinforcement learning to achieve reliable browser automation while keeping costs lower than using third-party API providers for their production workloads.
•Background agents require hierarchical tool management: When orchestrating 80-90 tools, reliability breaks down if all tools are available simultaneously. Sub-agents with access to specific tool subsets enable scalable multi-agent workflows that adapt based on real-time web information.

Notable Moment

Parikh reveals that despite initial assumptions, consuming web pages visually like humans rather than parsing underlying code proved essential for building reliable browser agents, as identical-looking pages often have completely different underlying structures.

Know someone who'd find this useful?