Simon Last

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

Apr 15, 202677 minNotion

AI Summary

→ WHAT IT COVERS Simon Last and Sarah Sachs from Notion detail five rebuilds of their AI agent system since 2022, covering the technical evolution from custom XML tool-calling to 100+ progressive disclosure tools, their MCP versus CLI tradeoffs, software factory vision, model behavior engineering as a distinct career path, and usage-based credit pricing for enterprise agentic workflows. → KEY INSIGHTS - **Agent Rebuild Cadence:** Notion rebuilt their agent harness five times since late 2022, with each iteration driven by a specific failure: custom XML tool formats the model didn't know, few-shot prompts that required 5-6 gatekeepers to edit one shared string, and context windows too short for multi-turn reliability. The unlock came with Claude Sonnet 3.6/3.7 in early 2024, when reasoning quality finally matched production requirements. - **Progressive Tool Disclosure:** Scaling beyond a certain tool count degrades agent quality — any engineer adding a niche tool would inadvertently cause the model to over-call it. Notion solved this by implementing progressive disclosure in their harness, now supporting 100+ tools without quality regression. The practical rule: never expose all tools simultaneously; build a search or filter layer so the model only sees contextually relevant tools per turn. - **Distributing Tool Ownership:** Moving from few-shot prompts to goal-driven tool definitions was the single largest velocity multiplier at Notion. Previously, 5-6 engineers controlled one shared prompt file where ordering and selection caused quality conflicts. Now each product team owns their tool definition and its eval, enabling parallel development. The tradeoff: duplicate tool names across teams can cause hard failures, requiring governance on tool naming conventions. - **Three-Tier Eval Architecture:** Notion runs evals at three distinct levels — CI regression tests with stochastic pass-rate thresholds, launch-blocking report cards requiring 80-90% pass rates across defined user journeys, and frontier headroom evals deliberately targeting 30% pass rates. The third tier, built in partnership with Anthropic and OpenAI over the past 2-3 months, prevents eval saturation and provides directional signal on where model capabilities are heading. - **MCP vs. CLI Decision Framework:** Use CLIs when agents need self-debugging capability within the same runtime environment — a broken MCP transport leaves the agent stranded with no recovery path. Use MCPs for narrow, tightly-permissioned agents where a full compute runtime is unnecessary and security boundaries matter. For high-frequency deterministic tasks, prefer direct API calls over MCP to avoid repeated token costs outside the cache window, which compounds into significant pricing inefficiency. - **Model Behavior Engineer Role:** Notion built a dedicated career path called Model Behavior Engineer, starting from people manually labeling Google Sheets outputs. The role now combines data science, prompt engineering, test design, and qualitative judgment — no software engineering background required. MBEs own frontier headroom evals, triage agent failures nightly via a custom agent, and work with a dedicated data scientist and eval engineer. Notion is actively hiring for this function. → NOTABLE MOMENT During the live demo, a custom agent built in roughly 15 minutes automatically enriched incoming coworking space applications by running web searches on each applicant and populating a structured database — with no human involvement after setup. The agent then flagged that it needed Gmail or Notion Mail connected to proceed, illustrating current permission boundary design. 💼 SPONSORS None detected 🏷️ AI Agents, Model Evaluation, MCP Protocol, Enterprise AI, Notion Product, Software Factory

Read Full Summary Listen

Featured On 1 Podcast

Latent Space

All Appearances

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

AI Summary

Explore More

Never miss Simon Last's insights