What are the key takeaways from this Latent Space episode?

Key insights include: **Agent Rebuild Cadence:** Notion rebuilt their agent harness five times since late 2022, with each iteration driven by a specific failure: custom XML tool formats the model didn't know, few-shot prompts that required 5-6 gatekeepers to edit one shared string, and context windows too short for multi-turn reliability. The unlock came with Claude Sonnet 3.6/3.7 in early 2024, when reasoning quality finally matched production requirements.; **Progressive Tool Disclosure:** Scaling beyond a certain tool count degrades agent quality — any engineer adding a niche tool would inadvertently cause the model to over-call it. Notion solved this by implementing progressive disclosure in their harness, now supporting 100+ tools without quality regression. The practical rule: never expose all tools simultaneously; build a search or filter layer so the model only sees contextually relevant tools per turn.; **Distributing Tool Ownership:** Moving from few-shot prompts to goal-driven tool definitions was the single largest velocity multiplier at Notion. Previously, 5-6 engineers controlled one shared prompt file where ordering and selection caused quality conflicts. Now each product team owns their tool definition and its eval, enabling parallel development. The tradeoff: duplicate tool names across teams can cause hard failures, requiring governance on tool naming conventions.

What did Simon Last and Sarah Sachs discuss on Latent Space?

Simon Last and Sarah Sachs from Notion detail five rebuilds of their AI agent system since 2022, covering the technical evolution from custom XML tool-calling to 100+ progressive disclosure tools, their MCP versus CLI tradeoffs, software factory vision, model behavior engineering as a distinct career path, and usage-based credit pricing for enterprise agentic workflows. Key topics include: **Agent Rebuild Cadence:** Notion rebuilt their agent harness five times since late 2022, with each iteration driven by a specific failure: custom XML tool formats the model didn't know, few-shot prompts that required 5-6 gatekeepers to edit one shared string, and context windows too short for multi-turn reliability. The unlock came with Claude Sonnet 3.6/3.7 in early 2024, when reasoning quality finally matched production requirements.; **Progressive Tool Disclosure:** Scaling beyond a certain tool count degrades agent quality — any engineer adding a niche tool would inadvertently cause the model to over-call it. Notion solved this by implementing progressive disclosure in their harness, now supporting 100+ tools without quality regression. The practical rule: never expose all tools simultaneously; build a search or filter layer so the model only sees contextually relevant tools per turn..

How long is this episode of Latent Space?

This episode is 77 minutes long. SignalCast provides an AI-generated summary so you can get the key insights in about 3 minutes.

Latent Space

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

April 15, 2026

77 min episode · 3 min read

Simon Last,Sarah Sachs

Episode

77 min

Read time

3 min

Topics

Career Growth, Productivity, Health & Wellness

AI-Generated Summary

Published Apr 15, 2026

Key Takeaways

✓Agent Rebuild Cadence: Notion rebuilt their agent harness five times since late 2022, with each iteration driven by a specific failure: custom XML tool formats the model didn't know, few-shot prompts that required 5-6 gatekeepers to edit one shared string, and context windows too short for multi-turn reliability. The unlock came with Claude Sonnet 3.6/3.7 in early 2024, when reasoning quality finally matched production requirements.
✓Progressive Tool Disclosure: Scaling beyond a certain tool count degrades agent quality — any engineer adding a niche tool would inadvertently cause the model to over-call it. Notion solved this by implementing progressive disclosure in their harness, now supporting 100+ tools without quality regression. The practical rule: never expose all tools simultaneously; build a search or filter layer so the model only sees contextually relevant tools per turn.
✓Distributing Tool Ownership: Moving from few-shot prompts to goal-driven tool definitions was the single largest velocity multiplier at Notion. Previously, 5-6 engineers controlled one shared prompt file where ordering and selection caused quality conflicts. Now each product team owns their tool definition and its eval, enabling parallel development. The tradeoff: duplicate tool names across teams can cause hard failures, requiring governance on tool naming conventions.
✓Three-Tier Eval Architecture: Notion runs evals at three distinct levels — CI regression tests with stochastic pass-rate thresholds, launch-blocking report cards requiring 80-90% pass rates across defined user journeys, and frontier headroom evals deliberately targeting 30% pass rates. The third tier, built in partnership with Anthropic and OpenAI over the past 2-3 months, prevents eval saturation and provides directional signal on where model capabilities are heading.
✓MCP vs. CLI Decision Framework: Use CLIs when agents need self-debugging capability within the same runtime environment — a broken MCP transport leaves the agent stranded with no recovery path. Use MCPs for narrow, tightly-permissioned agents where a full compute runtime is unnecessary and security boundaries matter. For high-frequency deterministic tasks, prefer direct API calls over MCP to avoid repeated token costs outside the cache window, which compounds into significant pricing inefficiency.

What It Covers

Simon Last and Sarah Sachs from Notion detail five rebuilds of their AI agent system since 2022, covering the technical evolution from custom XML tool-calling to 100+ progressive disclosure tools, their MCP versus CLI tradeoffs, software factory vision, model behavior engineering as a distinct career path, and usage-based credit pricing for enterprise agentic workflows.

Key Questions Answered

•Agent Rebuild Cadence: Notion rebuilt their agent harness five times since late 2022, with each iteration driven by a specific failure: custom XML tool formats the model didn't know, few-shot prompts that required 5-6 gatekeepers to edit one shared string, and context windows too short for multi-turn reliability. The unlock came with Claude Sonnet 3.6/3.7 in early 2024, when reasoning quality finally matched production requirements.
•Progressive Tool Disclosure: Scaling beyond a certain tool count degrades agent quality — any engineer adding a niche tool would inadvertently cause the model to over-call it. Notion solved this by implementing progressive disclosure in their harness, now supporting 100+ tools without quality regression. The practical rule: never expose all tools simultaneously; build a search or filter layer so the model only sees contextually relevant tools per turn.
•Distributing Tool Ownership: Moving from few-shot prompts to goal-driven tool definitions was the single largest velocity multiplier at Notion. Previously, 5-6 engineers controlled one shared prompt file where ordering and selection caused quality conflicts. Now each product team owns their tool definition and its eval, enabling parallel development. The tradeoff: duplicate tool names across teams can cause hard failures, requiring governance on tool naming conventions.
•Three-Tier Eval Architecture: Notion runs evals at three distinct levels — CI regression tests with stochastic pass-rate thresholds, launch-blocking report cards requiring 80-90% pass rates across defined user journeys, and frontier headroom evals deliberately targeting 30% pass rates. The third tier, built in partnership with Anthropic and OpenAI over the past 2-3 months, prevents eval saturation and provides directional signal on where model capabilities are heading.
•MCP vs. CLI Decision Framework: Use CLIs when agents need self-debugging capability within the same runtime environment — a broken MCP transport leaves the agent stranded with no recovery path. Use MCPs for narrow, tightly-permissioned agents where a full compute runtime is unnecessary and security boundaries matter. For high-frequency deterministic tasks, prefer direct API calls over MCP to avoid repeated token costs outside the cache window, which compounds into significant pricing inefficiency.
•Model Behavior Engineer Role: Notion built a dedicated career path called Model Behavior Engineer, starting from people manually labeling Google Sheets outputs. The role now combines data science, prompt engineering, test design, and qualitative judgment — no software engineering background required. MBEs own frontier headroom evals, triage agent failures nightly via a custom agent, and work with a dedicated data scientist and eval engineer. Notion is actively hiring for this function.

Notable Moment

During the live demo, a custom agent built in roughly 15 minutes automatically enriched incoming coworking space applications by running web searches on each applicant and populating a structured database — with no human involvement after setup. The agent then flagged that it needed Gmail or Notion Mail connected to proceed, illustrating current permission boundary design.

Know someone who'd find this useful?

You just read a 3-minute summary of a 74-minute episode.

Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Similar Episodes

Related episodes from other podcasts

The Daily (NYT)

Apr 7

Explore Related Topics

📊Career Growth ⚡Productivity 🏃Health & Wellness

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's Health & Longevity Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into Latent Space.

Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for one show.

Start My Monday Digest

No credit card · Unsubscribe anytime

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

🔬 The Lab of the Future Should Feel Like a Data Center — Andy Beam & Rafa Gómez-Bombarelli, Lila Sciences

A Daring Rescue Behind Enemy Lines

Why AI Infrastructure must evolve for Agent Experience — Akshat Bubna, Modal CTO

From Coder to Manager: Navigating the Shift to Agentic Engineering with Notion Co-Founder Simon Last

More from Latent Space

🔬 The Lab of the Future Should Feel Like a Data Center — Andy Beam & Rafa Gómez-Bombarelli, Lila Sciences

Why AI Infrastructure must evolve for Agent Experience — Akshat Bubna, Modal CTO

🔬 The Coolest Diffusion Research Isn't in LLMs — Evan Feinberg & Sergey Edunov, Genesis Molecular AI

Why the Frontier Ecosystem must be Open — Matei Zaharia and Reynold Xin, Databricks

Red-Teaming after Mythos — Zico Kolter & Matt Fredrikson, Gray Swan

Similar Episodes

A Daring Rescue Behind Enemy Lines

From Coder to Manager: Navigating the Shift to Agentic Engineering with Notion Co-Founder Simon Last

How Influencers Drive Sales Pipeline | Michael Manzur - 1935

Ed Catmull, Co-founder of Pixar

Episode 836 | The 5 A.I. Moats Acquirers Value Most

Explore Related Topics

You're clearly into Latent Space.