Optimizing Agent Behavior in Production with Gideon Mendels

February 17, 2026

52 min episode · 2 min read

Gideon Mendels

Episode

52 min

Read time

2 min

Topics

Product & Tech Trends, Psychology & Behavior

AI-Generated Summary

Published Feb 17, 2026

Key Takeaways

✓Evaluation Dataset Bootstrap Strategy: Build evaluation datasets by capturing production failures and user complaints rather than creating synthetic data upfront. When users report incorrect agent responses, document the input, expected output, and context as test cases. Starting with just 20 real-world samples provides enough foundation to run optimization algorithms and prevent regressions, making evals practical rather than theoretical.
✓Prompt Optimization as Search Problem: Treat system prompts, tool descriptions, and configurations as hyperparameters in a search space. Algorithms like JEPPA use LLMs to analyze failed test cases, suggest new prompt candidates, and iteratively improve performance. LangChain's JSON schema prompt improved from 12% to 96% accuracy in two iterations for under one dollar in API costs, demonstrating rapid, cost-effective optimization.
✓Configuration Management Over Version Control: Store prompts and agent configurations in a centralized registry rather than embedding them in code repositories. Applications fetch current configurations at runtime, enabling product managers to update prompts without redeployment, support AB testing across traffic percentages, and implement canary deployments. This separates agent behavior updates from application deployment cycles.
✓End-to-End Testing Priority: Start evaluation efforts with system-level tests that validate complete agent workflows before building unit tests for individual components. The easiest high-value evaluation checks whether agents call the correct tools given specific contexts, essentially treating tool selection as a classification problem. This provides broad coverage without requiring detailed graph traversal validation.
✓Framework Selection Reality: Approximately 80% of successful production agents use custom-built implementations rather than established agent frameworks. Teams achieve better results by using frontier models initially, building small evaluation datasets first, and optimizing for functionality before cost. Token costs decrease roughly 90% year-over-year, making premature optimization counterproductive when establishing baseline agent performance.

What It Covers

Gideon Mendels, CEO of Comet, explains how LLM-powered agents require new evaluation and optimization approaches that blend software engineering and ML practices. He covers building evaluation datasets from production failures, using LLMs to automatically optimize prompts through search algorithms, and creating continuous improvement loops for agents in production environments.

Key Questions Answered

•Evaluation Dataset Bootstrap Strategy: Build evaluation datasets by capturing production failures and user complaints rather than creating synthetic data upfront. When users report incorrect agent responses, document the input, expected output, and context as test cases. Starting with just 20 real-world samples provides enough foundation to run optimization algorithms and prevent regressions, making evals practical rather than theoretical.
•Prompt Optimization as Search Problem: Treat system prompts, tool descriptions, and configurations as hyperparameters in a search space. Algorithms like JEPPA use LLMs to analyze failed test cases, suggest new prompt candidates, and iteratively improve performance. LangChain's JSON schema prompt improved from 12% to 96% accuracy in two iterations for under one dollar in API costs, demonstrating rapid, cost-effective optimization.
•Configuration Management Over Version Control: Store prompts and agent configurations in a centralized registry rather than embedding them in code repositories. Applications fetch current configurations at runtime, enabling product managers to update prompts without redeployment, support AB testing across traffic percentages, and implement canary deployments. This separates agent behavior updates from application deployment cycles.
•End-to-End Testing Priority: Start evaluation efforts with system-level tests that validate complete agent workflows before building unit tests for individual components. The easiest high-value evaluation checks whether agents call the correct tools given specific contexts, essentially treating tool selection as a classification problem. This provides broad coverage without requiring detailed graph traversal validation.
•Framework Selection Reality: Approximately 80% of successful production agents use custom-built implementations rather than established agent frameworks. Teams achieve better results by using frontier models initially, building small evaluation datasets first, and optimizing for functionality before cost. Token costs decrease roughly 90% year-over-year, making premature optimization counterproductive when establishing baseline agent performance.

Notable Moment

Mendels reveals that most teams building agents skip evaluation datasets entirely, relying on manual testing of a few inputs before production deployment. This vibe-checking approach explains why fewer production agents exist than expected, as teams lack the systematic validation needed to confidently ship updates to nondeterministic systems that behave unpredictably.

Know someone who'd find this useful?