Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

September 25, 2025

106 min episode · 2 min read

Hamel Husain,Shreya Shankar

Episode

106 min

Read time

2 min

Topics

Artificial Intelligence, Product & Tech Trends

AI-Generated Summary

Published Dec 25, 2025

Key Takeaways

✓Error Analysis Process: Start by manually reviewing 100 production traces, writing freeform notes on the first upstream error encountered in each trace. This open coding reveals actual failure modes versus hypothetical ones, taking approximately one week initially then 30 minutes weekly for ongoing monitoring and improvement.
✓Benevolent Dictator Approach: Assign one domain expert with product taste to conduct error analysis rather than forming committees. This person should be the product manager for most applications, keeping the process tractable and avoiding expensive consensus-building that prevents teams from executing systematic evaluation workflows.
✓Axial Coding with AI: After collecting open codes, use LLMs to categorize notes into actionable failure modes through axial coding. Create pivot tables to count occurrences, revealing the most prevalent issues. This transforms chaos into prioritized problems, though AI cannot replace the initial human error analysis step.
✓Binary LLM Judges: Build automated evaluators that output pass/fail for specific failure modes, not Likert scales. Validate judges against human labels using confusion matrices before deployment. These judges monitor production traces continuously, catching issues like inappropriate handoffs or hallucinated features that code-based tests cannot detect.
✓Strategic Eval Investment: Write 4-7 LLM judge evaluators for persistent, ambiguous failure modes that resist prompt fixes. Skip evals for obvious engineering errors. The highest ROI activity involves looking at actual user interaction data, which most teams neglect by jumping straight to hypothetical test cases without grounding in real problems.

What It Covers

Hamel Husain and Shreya Shankar, creators of the top-rated Maven evals course, demonstrate systematic methods for building AI product evaluations through error analysis, open coding, and LLM judges to measure and improve application quality beyond vibes-based testing.

Key Questions Answered

•Error Analysis Process: Start by manually reviewing 100 production traces, writing freeform notes on the first upstream error encountered in each trace. This open coding reveals actual failure modes versus hypothetical ones, taking approximately one week initially then 30 minutes weekly for ongoing monitoring and improvement.
•Benevolent Dictator Approach: Assign one domain expert with product taste to conduct error analysis rather than forming committees. This person should be the product manager for most applications, keeping the process tractable and avoiding expensive consensus-building that prevents teams from executing systematic evaluation workflows.
•Axial Coding with AI: After collecting open codes, use LLMs to categorize notes into actionable failure modes through axial coding. Create pivot tables to count occurrences, revealing the most prevalent issues. This transforms chaos into prioritized problems, though AI cannot replace the initial human error analysis step.
•Binary LLM Judges: Build automated evaluators that output pass/fail for specific failure modes, not Likert scales. Validate judges against human labels using confusion matrices before deployment. These judges monitor production traces continuously, catching issues like inappropriate handoffs or hallucinated features that code-based tests cannot detect.
•Strategic Eval Investment: Write 4-7 LLM judge evaluators for persistent, ambiguous failure modes that resist prompt fixes. Skip evals for obvious engineering errors. The highest ROI activity involves looking at actual user interaction data, which most teams neglect by jumping straight to hypothetical test cases without grounding in real problems.

Notable Moment

The demonstration revealed how a real estate AI assistant told a prospective tenant no apartments with studies were available, then said thanks and goodbye—completely missing the lead nurturing opportunity. This failure only became visible through systematic trace review, not through vibes or generic benchmarks.

Know someone who'd find this useful?

You just read a 3-minute summary of a 103-minute episode.

Get Lenny's Podcast summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Similar Episodes

Related episodes from other podcasts

a16z Podcast

Apr 27

Explore Related Topics

🤖Artificial Intelligence 🔮Product & Tech Trends

This podcast is featured in Best Product Management Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into Lenny's Podcast.

Every Monday, we deliver AI summaries of the latest episodes from Lenny's Podcast and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

Snapchat CEO: Why distribution has become the most important moat | Evan Spiegel

Ben Horowitz on Venture Capital and AI

How Anthropic’s product team moves faster than anyone else | Cat Wu (Head of Product, Claude Code)

White House Response To Shooting, Shooter Investigation, King Charles State Visit

More from Lenny's Podcast

Snapchat CEO: Why distribution has become the most important moat | Evan Spiegel

How Anthropic’s product team moves faster than anyone else | Cat Wu (Head of Product, Claude Code)

Why half of product managers are in trouble | Nikhyl Singhal (Meta, Google)

Hard truths about building in the AI era | Keith Rabois (Khosla Ventures)

Head of Growth (Anthropic): “Claude is growing itself at this point” | Amol Avasare

Similar Episodes

Ben Horowitz on Venture Capital and AI

White House Response To Shooting, Shooter Investigation, King Charles State Visit

Why International Stocks Are Beating the S&P + How Scott Invests his Money

🏈 “Endorse My Ball” — Fernando Mendoza’s LinkedIn-ing. Intel’s chip-rip-dip. The Vatican’s AI savior. +Uber Spy Pricing

Premium and affordable products are having a moment

Explore Related Topics

You're clearly into Lenny's Podcast.