Skip to main content
Software Engineering Daily

Optimizing Agent Behavior in Production with Gideon Mendels

52 min episode · 2 min read
·

Episode

52 min

Read time

2 min

Topics

Product & Tech Trends, Psychology & Behavior

AI-Generated Summary

Key Takeaways

  • Evaluation Dataset Bootstrap Strategy: Build evaluation datasets by capturing production failures and user complaints rather than creating synthetic data upfront. When users report incorrect agent responses, document the input, expected output, and context as test cases. Starting with just 20 real-world samples provides enough foundation to run optimization algorithms and prevent regressions, making evals practical rather than theoretical.
  • Prompt Optimization as Search Problem: Treat system prompts, tool descriptions, and configurations as hyperparameters in a search space. Algorithms like JEPPA use LLMs to analyze failed test cases, suggest new prompt candidates, and iteratively improve performance. LangChain's JSON schema prompt improved from 12% to 96% accuracy in two iterations for under one dollar in API costs, demonstrating rapid, cost-effective optimization.
  • Configuration Management Over Version Control: Store prompts and agent configurations in a centralized registry rather than embedding them in code repositories. Applications fetch current configurations at runtime, enabling product managers to update prompts without redeployment, support AB testing across traffic percentages, and implement canary deployments. This separates agent behavior updates from application deployment cycles.
  • End-to-End Testing Priority: Start evaluation efforts with system-level tests that validate complete agent workflows before building unit tests for individual components. The easiest high-value evaluation checks whether agents call the correct tools given specific contexts, essentially treating tool selection as a classification problem. This provides broad coverage without requiring detailed graph traversal validation.
  • Framework Selection Reality: Approximately 80% of successful production agents use custom-built implementations rather than established agent frameworks. Teams achieve better results by using frontier models initially, building small evaluation datasets first, and optimizing for functionality before cost. Token costs decrease roughly 90% year-over-year, making premature optimization counterproductive when establishing baseline agent performance.

What It Covers

Gideon Mendels, CEO of Comet, explains how LLM-powered agents require new evaluation and optimization approaches that blend software engineering and ML practices. He covers building evaluation datasets from production failures, using LLMs to automatically optimize prompts through search algorithms, and creating continuous improvement loops for agents in production environments.

Key Questions Answered

  • Evaluation Dataset Bootstrap Strategy: Build evaluation datasets by capturing production failures and user complaints rather than creating synthetic data upfront. When users report incorrect agent responses, document the input, expected output, and context as test cases. Starting with just 20 real-world samples provides enough foundation to run optimization algorithms and prevent regressions, making evals practical rather than theoretical.
  • Prompt Optimization as Search Problem: Treat system prompts, tool descriptions, and configurations as hyperparameters in a search space. Algorithms like JEPPA use LLMs to analyze failed test cases, suggest new prompt candidates, and iteratively improve performance. LangChain's JSON schema prompt improved from 12% to 96% accuracy in two iterations for under one dollar in API costs, demonstrating rapid, cost-effective optimization.
  • Configuration Management Over Version Control: Store prompts and agent configurations in a centralized registry rather than embedding them in code repositories. Applications fetch current configurations at runtime, enabling product managers to update prompts without redeployment, support AB testing across traffic percentages, and implement canary deployments. This separates agent behavior updates from application deployment cycles.
  • End-to-End Testing Priority: Start evaluation efforts with system-level tests that validate complete agent workflows before building unit tests for individual components. The easiest high-value evaluation checks whether agents call the correct tools given specific contexts, essentially treating tool selection as a classification problem. This provides broad coverage without requiring detailed graph traversal validation.
  • Framework Selection Reality: Approximately 80% of successful production agents use custom-built implementations rather than established agent frameworks. Teams achieve better results by using frontier models initially, building small evaluation datasets first, and optimizing for functionality before cost. Token costs decrease roughly 90% year-over-year, making premature optimization counterproductive when establishing baseline agent performance.

Notable Moment

Mendels reveals that most teams building agents skip evaluation datasets entirely, relying on manual testing of a few inputs before production deployment. This vibe-checking approach explains why fewer production agents exist than expected, as teams lack the systematic validation needed to confidently ship updates to nondeterministic systems that behave unpredictably.

Know someone who'd find this useful?

You just read a 3-minute summary of a 49-minute episode.

Get Software Engineering Daily summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

More from Software Engineering Daily

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

Explore Related Topics

This podcast is featured in Best Cybersecurity Podcasts (2026) — ranked and reviewed with AI summaries.

You're clearly into Software Engineering Daily.

Every Monday, we deliver AI summaries of the latest episodes from Software Engineering Daily and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime