Skip to main content
The TWIML AI Podcast

Proactive Agents for the Web with Devi Parikh - #756

56 min episode · 2 min read
·

Episode

56 min

Read time

2 min

AI-Generated Summary

Key Takeaways

  • Visual-based browser navigation: Training models on website screenshots rather than DOM information proves more reliable and generalizable across different sites, solving challenges like date pickers that plagued DOM-based approaches with constant edge cases requiring site-specific solutions.
  • Scouts architecture combines APIs and browser automation: The system uses 80-90 MCP servers for structured data access but spins up remote browsers with custom-trained navigator models for information behind forms, optimizing for coverage first then precision in user-facing reports.
  • Post-training progression maximizes model capability: Utori trains QwQ models through supervised fine-tuning, then rejection sampling, then reinforcement learning to achieve reliable browser automation while keeping costs lower than using third-party API providers for their production workloads.
  • Background agents require hierarchical tool management: When orchestrating 80-90 tools, reliability breaks down if all tools are available simultaneously. Sub-agents with access to specific tool subsets enable scalable multi-agent workflows that adapt based on real-time web information.

What It Covers

Devi Parikh, co-founder of Utori, explains how AI browser agents will replace manual web interactions through proactive monitoring and automation, starting with Scouts, their product that monitors websites for user-specified information changes.

Key Questions Answered

  • Visual-based browser navigation: Training models on website screenshots rather than DOM information proves more reliable and generalizable across different sites, solving challenges like date pickers that plagued DOM-based approaches with constant edge cases requiring site-specific solutions.
  • Scouts architecture combines APIs and browser automation: The system uses 80-90 MCP servers for structured data access but spins up remote browsers with custom-trained navigator models for information behind forms, optimizing for coverage first then precision in user-facing reports.
  • Post-training progression maximizes model capability: Utori trains QwQ models through supervised fine-tuning, then rejection sampling, then reinforcement learning to achieve reliable browser automation while keeping costs lower than using third-party API providers for their production workloads.
  • Background agents require hierarchical tool management: When orchestrating 80-90 tools, reliability breaks down if all tools are available simultaneously. Sub-agents with access to specific tool subsets enable scalable multi-agent workflows that adapt based on real-time web information.

Notable Moment

Parikh reveals that despite initial assumptions, consuming web pages visually like humans rather than parsing underlying code proved essential for building reliable browser agents, as identical-looking pages often have completely different underlying structures.

Know someone who'd find this useful?

You just read a 3-minute summary of a 53-minute episode.

Get The TWIML AI Podcast summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

More from The TWIML AI Podcast

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

You're clearly into The TWIML AI Podcast.

Every Monday, we deliver AI summaries of the latest episodes from The TWIML AI Podcast and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime