Optimizing Agent Behavior in Production with Gideon Mendels
Episode
52 min
Read time
2 min
Topics
Productivity, Investing, Startups
AI-Generated Summary
Key Takeaways
- ✓Evaluation Dataset Bootstrap Strategy: Build evaluation datasets by capturing production failures and user complaints rather than creating synthetic data upfront. When users report incorrect agent responses, document the input, expected output, and context as test cases. Starting with just 20 real-world samples provides enough foundation to run optimization algorithms and prevent regressions, making evals practical rather than theoretical.
- ✓Prompt Optimization as Search Problem: Treat system prompts, tool descriptions, and configurations as hyperparameters in a search space. Algorithms like JEPPA use LLMs to analyze failed test cases, suggest new prompt candidates, and iteratively improve performance. LangChain's JSON schema prompt improved from 12% to 96% accuracy in two iterations for under one dollar in API costs, demonstrating rapid, cost-effective optimization.
- ✓Configuration Management Over Version Control: Store prompts and agent configurations in a centralized registry rather than embedding them in code repositories. Applications fetch current configurations at runtime, enabling product managers to update prompts without redeployment, support AB testing across traffic percentages, and implement canary deployments. This separates agent behavior updates from application deployment cycles.
- ✓End-to-End Testing Priority: Start evaluation efforts with system-level tests that validate complete agent workflows before building unit tests for individual components. The easiest high-value evaluation checks whether agents call the correct tools given specific contexts, essentially treating tool selection as a classification problem. This provides broad coverage without requiring detailed graph traversal validation.
- ✓Framework Selection Reality: Approximately 80% of successful production agents use custom-built implementations rather than established agent frameworks. Teams achieve better results by using frontier models initially, building small evaluation datasets first, and optimizing for functionality before cost. Token costs decrease roughly 90% year-over-year, making premature optimization counterproductive when establishing baseline agent performance.
What It Covers
Gideon Mendels, CEO of Comet, explains how LLM-powered agents require new evaluation and optimization approaches that blend software engineering and ML practices. He covers building evaluation datasets from production failures, using LLMs to automatically optimize prompts through search algorithms, and creating continuous improvement loops for agents in production environments.
Key Questions Answered
- •Evaluation Dataset Bootstrap Strategy: Build evaluation datasets by capturing production failures and user complaints rather than creating synthetic data upfront. When users report incorrect agent responses, document the input, expected output, and context as test cases. Starting with just 20 real-world samples provides enough foundation to run optimization algorithms and prevent regressions, making evals practical rather than theoretical.
- •Prompt Optimization as Search Problem: Treat system prompts, tool descriptions, and configurations as hyperparameters in a search space. Algorithms like JEPPA use LLMs to analyze failed test cases, suggest new prompt candidates, and iteratively improve performance. LangChain's JSON schema prompt improved from 12% to 96% accuracy in two iterations for under one dollar in API costs, demonstrating rapid, cost-effective optimization.
- •Configuration Management Over Version Control: Store prompts and agent configurations in a centralized registry rather than embedding them in code repositories. Applications fetch current configurations at runtime, enabling product managers to update prompts without redeployment, support AB testing across traffic percentages, and implement canary deployments. This separates agent behavior updates from application deployment cycles.
- •End-to-End Testing Priority: Start evaluation efforts with system-level tests that validate complete agent workflows before building unit tests for individual components. The easiest high-value evaluation checks whether agents call the correct tools given specific contexts, essentially treating tool selection as a classification problem. This provides broad coverage without requiring detailed graph traversal validation.
- •Framework Selection Reality: Approximately 80% of successful production agents use custom-built implementations rather than established agent frameworks. Teams achieve better results by using frontier models initially, building small evaluation datasets first, and optimizing for functionality before cost. Token costs decrease roughly 90% year-over-year, making premature optimization counterproductive when establishing baseline agent performance.
Notable Moment
Mendels reveals that most teams building agents skip evaluation datasets entirely, relying on manual testing of a few inputs before production deployment. This vibe-checking approach explains why fewer production agents exist than expected, as teams lack the systematic validation needed to confidently ship updates to nondeterministic systems that behave unpredictably.
You just read a 3-minute summary of a 49-minute episode.
Get Software Engineering Daily summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Software Engineering Daily
SED News: Apple’s AI Problem, The Real Business Model of AI, and Token Cost Reckoning
Jun 9 · 48 min
The TWIML AI Podcast
How to Find the Agent Failures Your Evals Miss with Scott Clark - #767
May 7
More from Software Engineering Daily
Web Native Game Development
Jun 4 · 54 min
a16z Podcast
Building Search for AI Agents with Exa CEO Will Bryk
Jun 6
Books, tools, and gear mentioned in this episode
SignalCast may earn commission on purchases via these links. As an Amazon Associate, SignalCast earns from qualifying purchases.
Tools
“LangChain's JSON schema prompt improved from 12% to 96% accuracy in two iterations for under one dollar in API costs, demonstrating rapid, cost-effective optimization.”
company
“Gideon Mendels, CEO of Comet, explains how LLM-powered agents require new evaluation and optimization approaches that blend software engineering and ML practices.”
More from Software Engineering Daily
We summarize every new episode. Want them in your inbox?
SED News: Apple’s AI Problem, The Real Business Model of AI, and Token Cost Reckoning
Web Native Game Development
The Hardware Bottleneck AI Can’t Fix
Autonomous Drone Delivery at Scale
The European Startup Scene
Similar Episodes
Related episodes from other podcasts
The TWIML AI Podcast
May 7
How to Find the Agent Failures Your Evals Miss with Scott Clark - #767
a16z Podcast
Jun 6
Building Search for AI Agents with Exa CEO Will Bryk
NVIDIA AI Podcast
May 6
Harrison Chase of LangChain on Deep Agents, LangSmith, and Earning Trust | NVIDIA AI Podcast Ep. 297
Bankless
Feb 5
AI on Ethereum: ERC-8004, x402, OpenClaw and the Botconomy | Austin Griffith & Davide Crapis
Eye on AI
Jan 11
#313 Jonathan Wall: AI Agents Are Reshaping the Future of Compute Infrastructure
Explore Related Topics
This podcast is featured in Best Cybersecurity Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's Investing & Markets Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into Software Engineering Daily.
Every Monday, we deliver AI summaries of the latest episodes from Software Engineering Daily and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime