Skip to main content
Latent Space

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

77 min episode · 3 min read
·

Episode

77 min

Read time

3 min

Topics

Software Development

AI-Generated Summary

Key Takeaways

  • Agent Rebuild Cadence: Notion rebuilt their agent harness five times since late 2022, with each iteration driven by a specific failure: custom XML tool formats the model didn't know, few-shot prompts that required 5-6 gatekeepers to edit one shared string, and context windows too short for multi-turn reliability. The unlock came with Claude Sonnet 3.6/3.7 in early 2024, when reasoning quality finally matched production requirements.
  • Progressive Tool Disclosure: Scaling beyond a certain tool count degrades agent quality — any engineer adding a niche tool would inadvertently cause the model to over-call it. Notion solved this by implementing progressive disclosure in their harness, now supporting 100+ tools without quality regression. The practical rule: never expose all tools simultaneously; build a search or filter layer so the model only sees contextually relevant tools per turn.
  • Distributing Tool Ownership: Moving from few-shot prompts to goal-driven tool definitions was the single largest velocity multiplier at Notion. Previously, 5-6 engineers controlled one shared prompt file where ordering and selection caused quality conflicts. Now each product team owns their tool definition and its eval, enabling parallel development. The tradeoff: duplicate tool names across teams can cause hard failures, requiring governance on tool naming conventions.
  • Three-Tier Eval Architecture: Notion runs evals at three distinct levels — CI regression tests with stochastic pass-rate thresholds, launch-blocking report cards requiring 80-90% pass rates across defined user journeys, and frontier headroom evals deliberately targeting 30% pass rates. The third tier, built in partnership with Anthropic and OpenAI over the past 2-3 months, prevents eval saturation and provides directional signal on where model capabilities are heading.
  • MCP vs. CLI Decision Framework: Use CLIs when agents need self-debugging capability within the same runtime environment — a broken MCP transport leaves the agent stranded with no recovery path. Use MCPs for narrow, tightly-permissioned agents where a full compute runtime is unnecessary and security boundaries matter. For high-frequency deterministic tasks, prefer direct API calls over MCP to avoid repeated token costs outside the cache window, which compounds into significant pricing inefficiency.

What It Covers

Simon Last and Sarah Sachs from Notion detail five rebuilds of their AI agent system since 2022, covering the technical evolution from custom XML tool-calling to 100+ progressive disclosure tools, their MCP versus CLI tradeoffs, software factory vision, model behavior engineering as a distinct career path, and usage-based credit pricing for enterprise agentic workflows.

Key Questions Answered

  • Agent Rebuild Cadence: Notion rebuilt their agent harness five times since late 2022, with each iteration driven by a specific failure: custom XML tool formats the model didn't know, few-shot prompts that required 5-6 gatekeepers to edit one shared string, and context windows too short for multi-turn reliability. The unlock came with Claude Sonnet 3.6/3.7 in early 2024, when reasoning quality finally matched production requirements.
  • Progressive Tool Disclosure: Scaling beyond a certain tool count degrades agent quality — any engineer adding a niche tool would inadvertently cause the model to over-call it. Notion solved this by implementing progressive disclosure in their harness, now supporting 100+ tools without quality regression. The practical rule: never expose all tools simultaneously; build a search or filter layer so the model only sees contextually relevant tools per turn.
  • Distributing Tool Ownership: Moving from few-shot prompts to goal-driven tool definitions was the single largest velocity multiplier at Notion. Previously, 5-6 engineers controlled one shared prompt file where ordering and selection caused quality conflicts. Now each product team owns their tool definition and its eval, enabling parallel development. The tradeoff: duplicate tool names across teams can cause hard failures, requiring governance on tool naming conventions.
  • Three-Tier Eval Architecture: Notion runs evals at three distinct levels — CI regression tests with stochastic pass-rate thresholds, launch-blocking report cards requiring 80-90% pass rates across defined user journeys, and frontier headroom evals deliberately targeting 30% pass rates. The third tier, built in partnership with Anthropic and OpenAI over the past 2-3 months, prevents eval saturation and provides directional signal on where model capabilities are heading.
  • MCP vs. CLI Decision Framework: Use CLIs when agents need self-debugging capability within the same runtime environment — a broken MCP transport leaves the agent stranded with no recovery path. Use MCPs for narrow, tightly-permissioned agents where a full compute runtime is unnecessary and security boundaries matter. For high-frequency deterministic tasks, prefer direct API calls over MCP to avoid repeated token costs outside the cache window, which compounds into significant pricing inefficiency.
  • Model Behavior Engineer Role: Notion built a dedicated career path called Model Behavior Engineer, starting from people manually labeling Google Sheets outputs. The role now combines data science, prompt engineering, test design, and qualitative judgment — no software engineering background required. MBEs own frontier headroom evals, triage agent failures nightly via a custom agent, and work with a dedicated data scientist and eval engineer. Notion is actively hiring for this function.

Notable Moment

During the live demo, a custom agent built in roughly 15 minutes automatically enriched incoming coworking space applications by running web searches on each applicant and populating a structured database — with no human involvement after setup. The agent then flagged that it needed Gmail or Notion Mail connected to proceed, illustrating current permission boundary design.

Know someone who'd find this useful?

You just read a 3-minute summary of a 74-minute episode.

Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

More from Latent Space

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

Explore Related Topics

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's Software Engineering Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into Latent Space.

Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime