Skip to main content
Lenny's Podcast

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

106 min episode · 2 min read
·

Episode

106 min

Read time

2 min

Topics

Artificial Intelligence, Product & Tech Trends

AI-Generated Summary

Key Takeaways

  • Error Analysis Process: Start by manually reviewing 100 production traces, writing freeform notes on the first upstream error encountered in each trace. This open coding reveals actual failure modes versus hypothetical ones, taking approximately one week initially then 30 minutes weekly for ongoing monitoring and improvement.
  • Benevolent Dictator Approach: Assign one domain expert with product taste to conduct error analysis rather than forming committees. This person should be the product manager for most applications, keeping the process tractable and avoiding expensive consensus-building that prevents teams from executing systematic evaluation workflows.
  • Axial Coding with AI: After collecting open codes, use LLMs to categorize notes into actionable failure modes through axial coding. Create pivot tables to count occurrences, revealing the most prevalent issues. This transforms chaos into prioritized problems, though AI cannot replace the initial human error analysis step.
  • Binary LLM Judges: Build automated evaluators that output pass/fail for specific failure modes, not Likert scales. Validate judges against human labels using confusion matrices before deployment. These judges monitor production traces continuously, catching issues like inappropriate handoffs or hallucinated features that code-based tests cannot detect.
  • Strategic Eval Investment: Write 4-7 LLM judge evaluators for persistent, ambiguous failure modes that resist prompt fixes. Skip evals for obvious engineering errors. The highest ROI activity involves looking at actual user interaction data, which most teams neglect by jumping straight to hypothetical test cases without grounding in real problems.

What It Covers

Hamel Husain and Shreya Shankar, creators of the top-rated Maven evals course, demonstrate systematic methods for building AI product evaluations through error analysis, open coding, and LLM judges to measure and improve application quality beyond vibes-based testing.

Key Questions Answered

  • Error Analysis Process: Start by manually reviewing 100 production traces, writing freeform notes on the first upstream error encountered in each trace. This open coding reveals actual failure modes versus hypothetical ones, taking approximately one week initially then 30 minutes weekly for ongoing monitoring and improvement.
  • Benevolent Dictator Approach: Assign one domain expert with product taste to conduct error analysis rather than forming committees. This person should be the product manager for most applications, keeping the process tractable and avoiding expensive consensus-building that prevents teams from executing systematic evaluation workflows.
  • Axial Coding with AI: After collecting open codes, use LLMs to categorize notes into actionable failure modes through axial coding. Create pivot tables to count occurrences, revealing the most prevalent issues. This transforms chaos into prioritized problems, though AI cannot replace the initial human error analysis step.
  • Binary LLM Judges: Build automated evaluators that output pass/fail for specific failure modes, not Likert scales. Validate judges against human labels using confusion matrices before deployment. These judges monitor production traces continuously, catching issues like inappropriate handoffs or hallucinated features that code-based tests cannot detect.
  • Strategic Eval Investment: Write 4-7 LLM judge evaluators for persistent, ambiguous failure modes that resist prompt fixes. Skip evals for obvious engineering errors. The highest ROI activity involves looking at actual user interaction data, which most teams neglect by jumping straight to hypothetical test cases without grounding in real problems.

Notable Moment

The demonstration revealed how a real estate AI assistant told a prospective tenant no apartments with studies were available, then said thanks and goodbye—completely missing the lead nurturing opportunity. This failure only became visible through systematic trace review, not through vibes or generic benchmarks.

Know someone who'd find this useful?

You just read a 3-minute summary of a 103-minute episode.

Get Lenny's Podcast summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

More from Lenny's Podcast

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

Explore Related Topics

This podcast is featured in Best Product Management Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into Lenny's Podcast.

Every Monday, we deliver AI summaries of the latest episodes from Lenny's Podcast and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime