Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion
Episode
77 min
Read time
3 min
Topics
Software Development
AI-Generated Summary
Key Takeaways
- ✓Agent Rebuild Cadence: Notion rebuilt their agent harness five times since late 2022, with each iteration driven by a specific failure: custom XML tool formats the model didn't know, few-shot prompts that required 5-6 gatekeepers to edit one shared string, and context windows too short for multi-turn reliability. The unlock came with Claude Sonnet 3.6/3.7 in early 2024, when reasoning quality finally matched production requirements.
- ✓Progressive Tool Disclosure: Scaling beyond a certain tool count degrades agent quality — any engineer adding a niche tool would inadvertently cause the model to over-call it. Notion solved this by implementing progressive disclosure in their harness, now supporting 100+ tools without quality regression. The practical rule: never expose all tools simultaneously; build a search or filter layer so the model only sees contextually relevant tools per turn.
- ✓Distributing Tool Ownership: Moving from few-shot prompts to goal-driven tool definitions was the single largest velocity multiplier at Notion. Previously, 5-6 engineers controlled one shared prompt file where ordering and selection caused quality conflicts. Now each product team owns their tool definition and its eval, enabling parallel development. The tradeoff: duplicate tool names across teams can cause hard failures, requiring governance on tool naming conventions.
- ✓Three-Tier Eval Architecture: Notion runs evals at three distinct levels — CI regression tests with stochastic pass-rate thresholds, launch-blocking report cards requiring 80-90% pass rates across defined user journeys, and frontier headroom evals deliberately targeting 30% pass rates. The third tier, built in partnership with Anthropic and OpenAI over the past 2-3 months, prevents eval saturation and provides directional signal on where model capabilities are heading.
- ✓MCP vs. CLI Decision Framework: Use CLIs when agents need self-debugging capability within the same runtime environment — a broken MCP transport leaves the agent stranded with no recovery path. Use MCPs for narrow, tightly-permissioned agents where a full compute runtime is unnecessary and security boundaries matter. For high-frequency deterministic tasks, prefer direct API calls over MCP to avoid repeated token costs outside the cache window, which compounds into significant pricing inefficiency.
What It Covers
Simon Last and Sarah Sachs from Notion detail five rebuilds of their AI agent system since 2022, covering the technical evolution from custom XML tool-calling to 100+ progressive disclosure tools, their MCP versus CLI tradeoffs, software factory vision, model behavior engineering as a distinct career path, and usage-based credit pricing for enterprise agentic workflows.
Key Questions Answered
- •Agent Rebuild Cadence: Notion rebuilt their agent harness five times since late 2022, with each iteration driven by a specific failure: custom XML tool formats the model didn't know, few-shot prompts that required 5-6 gatekeepers to edit one shared string, and context windows too short for multi-turn reliability. The unlock came with Claude Sonnet 3.6/3.7 in early 2024, when reasoning quality finally matched production requirements.
- •Progressive Tool Disclosure: Scaling beyond a certain tool count degrades agent quality — any engineer adding a niche tool would inadvertently cause the model to over-call it. Notion solved this by implementing progressive disclosure in their harness, now supporting 100+ tools without quality regression. The practical rule: never expose all tools simultaneously; build a search or filter layer so the model only sees contextually relevant tools per turn.
- •Distributing Tool Ownership: Moving from few-shot prompts to goal-driven tool definitions was the single largest velocity multiplier at Notion. Previously, 5-6 engineers controlled one shared prompt file where ordering and selection caused quality conflicts. Now each product team owns their tool definition and its eval, enabling parallel development. The tradeoff: duplicate tool names across teams can cause hard failures, requiring governance on tool naming conventions.
- •Three-Tier Eval Architecture: Notion runs evals at three distinct levels — CI regression tests with stochastic pass-rate thresholds, launch-blocking report cards requiring 80-90% pass rates across defined user journeys, and frontier headroom evals deliberately targeting 30% pass rates. The third tier, built in partnership with Anthropic and OpenAI over the past 2-3 months, prevents eval saturation and provides directional signal on where model capabilities are heading.
- •MCP vs. CLI Decision Framework: Use CLIs when agents need self-debugging capability within the same runtime environment — a broken MCP transport leaves the agent stranded with no recovery path. Use MCPs for narrow, tightly-permissioned agents where a full compute runtime is unnecessary and security boundaries matter. For high-frequency deterministic tasks, prefer direct API calls over MCP to avoid repeated token costs outside the cache window, which compounds into significant pricing inefficiency.
- •Model Behavior Engineer Role: Notion built a dedicated career path called Model Behavior Engineer, starting from people manually labeling Google Sheets outputs. The role now combines data science, prompt engineering, test design, and qualitative judgment — no software engineering background required. MBEs own frontier headroom evals, triage agent failures nightly via a custom agent, and work with a dedicated data scientist and eval engineer. Notion is actively hiring for this function.
Notable Moment
During the live demo, a custom agent built in roughly 15 minutes automatically enriched incoming coworking space applications by running web searches on each applicant and populating a structured database — with no human involvement after setup. The agent then flagged that it needed Gmail or Notion Mail connected to proceed, illustrating current permission boundary design.
You just read a 3-minute summary of a 74-minute episode.
Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Latent Space
Extreme Harness Engineering for Token Billionaires: 1M LOC, 1B toks/day, 0% human code, 0% human review — Ryan Lopopolo, OpenAI Frontier & Symphony
Apr 7 · 72 min
20VC (20 Minute VC)
20VC: Jake Paul on Why Traditional VC is Toast and Attention is More Valuable Than Cash | Politics: Will Jake Paul Actually Run for President? | Inside the Payday of Fighting Anthony Joshua and Mike Tyson | with Geoffrey Wu, Co-Founder at Anti-Fund
Apr 18
More from Latent Space
Marc Andreessen introspects on The Death of the Browser, Pi + OpenClaw, and Why "This Time Is Different"
Apr 3 · 76 min
Odd Lots
Alex Imas on Why Economists Might Be Getting AI Wrong
Apr 18
More from Latent Space
We summarize every new episode. Want them in your inbox?
Extreme Harness Engineering for Token Billionaires: 1M LOC, 1B toks/day, 0% human code, 0% human review — Ryan Lopopolo, OpenAI Frontier & Symphony
Marc Andreessen introspects on The Death of the Browser, Pi + OpenClaw, and Why "This Time Is Different"
Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun
Mistral: Voxtral TTS, Forge, Leanstral, & what's next for Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample
🔬Why There Is No "AlphaFold for Materials" — AI for Materials Discovery with Heather Kulik
Similar Episodes
Related episodes from other podcasts
20VC (20 Minute VC)
Apr 18
20VC: Jake Paul on Why Traditional VC is Toast and Attention is More Valuable Than Cash | Politics: Will Jake Paul Actually Run for President? | Inside the Payday of Fighting Anthony Joshua and Mike Tyson | with Geoffrey Wu, Co-Founder at Anti-Fund
Odd Lots
Apr 18
Alex Imas on Why Economists Might Be Getting AI Wrong
No Priors: Artificial Intelligence | Technology | Startups
Apr 17
Scaling Global Organizations in the Age of AI with ServiceNow CEO Bill McDermott
All-In with Chamath, Jason, Sacks & Friedberg
Apr 17
OpenAI's Identity Crisis, Datacenter Wars, Market Up on Iran News, Mamdani's First Tax, Swalwell Out
The Startup Ideas Podcast
Apr 17
Seedance 2.0: Make 100 AI Ads in 33 mins
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's Software Engineering Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into Latent Space.
Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime