Optimizing Agent Behavior in Production with Gideon Mendels
Episode
52 min
Read time
2 min
Topics
Product & Tech Trends, Psychology & Behavior
AI-Generated Summary
Key Takeaways
- ✓Evaluation Dataset Bootstrap Strategy: Build evaluation datasets by capturing production failures and user complaints rather than creating synthetic data upfront. When users report incorrect agent responses, document the input, expected output, and context as test cases. Starting with just 20 real-world samples provides enough foundation to run optimization algorithms and prevent regressions, making evals practical rather than theoretical.
- ✓Prompt Optimization as Search Problem: Treat system prompts, tool descriptions, and configurations as hyperparameters in a search space. Algorithms like JEPPA use LLMs to analyze failed test cases, suggest new prompt candidates, and iteratively improve performance. LangChain's JSON schema prompt improved from 12% to 96% accuracy in two iterations for under one dollar in API costs, demonstrating rapid, cost-effective optimization.
- ✓Configuration Management Over Version Control: Store prompts and agent configurations in a centralized registry rather than embedding them in code repositories. Applications fetch current configurations at runtime, enabling product managers to update prompts without redeployment, support AB testing across traffic percentages, and implement canary deployments. This separates agent behavior updates from application deployment cycles.
- ✓End-to-End Testing Priority: Start evaluation efforts with system-level tests that validate complete agent workflows before building unit tests for individual components. The easiest high-value evaluation checks whether agents call the correct tools given specific contexts, essentially treating tool selection as a classification problem. This provides broad coverage without requiring detailed graph traversal validation.
- ✓Framework Selection Reality: Approximately 80% of successful production agents use custom-built implementations rather than established agent frameworks. Teams achieve better results by using frontier models initially, building small evaluation datasets first, and optimizing for functionality before cost. Token costs decrease roughly 90% year-over-year, making premature optimization counterproductive when establishing baseline agent performance.
What It Covers
Gideon Mendels, CEO of Comet, explains how LLM-powered agents require new evaluation and optimization approaches that blend software engineering and ML practices. He covers building evaluation datasets from production failures, using LLMs to automatically optimize prompts through search algorithms, and creating continuous improvement loops for agents in production environments.
Key Questions Answered
- •Evaluation Dataset Bootstrap Strategy: Build evaluation datasets by capturing production failures and user complaints rather than creating synthetic data upfront. When users report incorrect agent responses, document the input, expected output, and context as test cases. Starting with just 20 real-world samples provides enough foundation to run optimization algorithms and prevent regressions, making evals practical rather than theoretical.
- •Prompt Optimization as Search Problem: Treat system prompts, tool descriptions, and configurations as hyperparameters in a search space. Algorithms like JEPPA use LLMs to analyze failed test cases, suggest new prompt candidates, and iteratively improve performance. LangChain's JSON schema prompt improved from 12% to 96% accuracy in two iterations for under one dollar in API costs, demonstrating rapid, cost-effective optimization.
- •Configuration Management Over Version Control: Store prompts and agent configurations in a centralized registry rather than embedding them in code repositories. Applications fetch current configurations at runtime, enabling product managers to update prompts without redeployment, support AB testing across traffic percentages, and implement canary deployments. This separates agent behavior updates from application deployment cycles.
- •End-to-End Testing Priority: Start evaluation efforts with system-level tests that validate complete agent workflows before building unit tests for individual components. The easiest high-value evaluation checks whether agents call the correct tools given specific contexts, essentially treating tool selection as a classification problem. This provides broad coverage without requiring detailed graph traversal validation.
- •Framework Selection Reality: Approximately 80% of successful production agents use custom-built implementations rather than established agent frameworks. Teams achieve better results by using frontier models initially, building small evaluation datasets first, and optimizing for functionality before cost. Token costs decrease roughly 90% year-over-year, making premature optimization counterproductive when establishing baseline agent performance.
Notable Moment
Mendels reveals that most teams building agents skip evaluation datasets entirely, relying on manual testing of a few inputs before production deployment. This vibe-checking approach explains why fewer production agents exist than expected, as teams lack the systematic validation needed to confidently ship updates to nondeterministic systems that behave unpredictably.
You just read a 3-minute summary of a 49-minute episode.
Get Software Engineering Daily summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Software Engineering Daily
Hype and Reality of the AI Coding Shift
Apr 23 · 59 min
Masters of Scale
Possible: Netflix co-founder Reed Hastings: stories, schools, superpowers
Apr 25
More from Software Engineering Daily
Unlocking the Data Layer for Agentic AI with Simba Khadder
Apr 21 · 49 min
The Futur
Why Process is Better Than AI w/ Scott Clum | Ep 430
Apr 25
More from Software Engineering Daily
We summarize every new episode. Want them in your inbox?
Hype and Reality of the AI Coding Shift
Unlocking the Data Layer for Agentic AI with Simba Khadder
Agentic Mesh with Eric Broda
New Relic and Agentic DevOps with Nic Benders
Mobile App Security with Ryan Lloyd
Similar Episodes
Related episodes from other podcasts
Masters of Scale
Apr 25
Possible: Netflix co-founder Reed Hastings: stories, schools, superpowers
The Futur
Apr 25
Why Process is Better Than AI w/ Scott Clum | Ep 430
20VC (20 Minute VC)
Apr 25
20Product: Replit CEO on Why Coding Models Are Plateauing | Why the SaaS Apocalypse is Justified: Will Incumbents Be Replaced? | Why IDEs Are Dead and Do PMs Survive the Next 3-5 Years with Amjad Masad
This Week in Startups
Apr 25
The Defense Tech Startup YC Kicked Out of a Meeting is Now Arming America | E2280
Marketplace
Apr 24
When does AI become a spending suck?
Explore Related Topics
This podcast is featured in Best Cybersecurity Podcasts (2026) — ranked and reviewed with AI summaries.
You're clearly into Software Engineering Daily.
Every Monday, we deliver AI summaries of the latest episodes from Software Engineering Daily and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime