Skip to main content
How I AI

Sonnet 5 review: I ran 64 generations to find out if it's worth it

25 min episode · 2 min read

Episode

25 min

Read time

2 min

Topics

Investing, Fundraising & VC, Design & UX

AI-Generated Summary

Key Takeaways

  • Benchmark design: Build repeatable AI evals using frozen inputs, blind scoring, and a structured rubric rather than one-off vibe checks. Claude Code can scan past session history stored on your desktop to suggest relevant benchmark tasks tailored to your actual workflows, making setup faster and more personalized than starting from scratch.
  • Sonnet 5 pricing window: Sonnet 5 launches at $2 per million input tokens and $10 per million output tokens, with Anthropic confirming prices rise after summer 2025. For teams running high-volume agentic workloads, testing and locking in usage now captures near-Opus performance at a significant cost discount before the pricing structure changes.
  • Model-by-task routing: No single frontier model wins across all tasks. GPT-5.5 produces the most comprehensive PRDs, Sonnet 4.6 performs best for UI prototyping and conversational agents, and Opus 4.8 handles dense, complex UI generation. Routing prompts to the right model by task type outperforms defaulting to one model for everything.
  • Human vs. LLM judgment gap: When the host's 70% human-weighted scores were combined with 30% automated LLM scores, rankings flipped significantly from pure LLM evaluation. LLM judges cluster scores near the middle of the scale and miss visual taste signals, making human review essential for design and writing quality assessments.
  • Agentic benchmark saturation: Standard multi-step coding tasks no longer differentiate frontier models because GPT-5.5, Gemini 2.5 Pro, Opus 4.8, and Sonnet 5 all score similarly. Effective agentic evals need harder, more specialized tasks. Retiring saturated benchmarks and replacing them with higher-difficulty challenges is necessary to surface meaningful capability differences between models.

What It Covers

Host introduces the "How I AI Bench," a repeatable evaluation framework testing Claude Sonnet 5 against GPT-5.5, Gemini 2.5 Pro, Opus 4.8, and Sonnet 4.6 across 64 generations spanning PRD writing, UI prototyping, agentic coding, and voice personality tasks, revealing surprising model rankings.

Key Questions Answered

  • Benchmark design: Build repeatable AI evals using frozen inputs, blind scoring, and a structured rubric rather than one-off vibe checks. Claude Code can scan past session history stored on your desktop to suggest relevant benchmark tasks tailored to your actual workflows, making setup faster and more personalized than starting from scratch.
  • Sonnet 5 pricing window: Sonnet 5 launches at $2 per million input tokens and $10 per million output tokens, with Anthropic confirming prices rise after summer 2025. For teams running high-volume agentic workloads, testing and locking in usage now captures near-Opus performance at a significant cost discount before the pricing structure changes.
  • Model-by-task routing: No single frontier model wins across all tasks. GPT-5.5 produces the most comprehensive PRDs, Sonnet 4.6 performs best for UI prototyping and conversational agents, and Opus 4.8 handles dense, complex UI generation. Routing prompts to the right model by task type outperforms defaulting to one model for everything.
  • Human vs. LLM judgment gap: When the host's 70% human-weighted scores were combined with 30% automated LLM scores, rankings flipped significantly from pure LLM evaluation. LLM judges cluster scores near the middle of the scale and miss visual taste signals, making human review essential for design and writing quality assessments.
  • Agentic benchmark saturation: Standard multi-step coding tasks no longer differentiate frontier models because GPT-5.5, Gemini 2.5 Pro, Opus 4.8, and Sonnet 5 all score similarly. Effective agentic evals need harder, more specialized tasks. Retiring saturated benchmarks and replacing them with higher-difficulty challenges is necessary to surface meaningful capability differences between models.

Notable Moment

The live leaderboard reveal produced an unexpected result: Gemini 2.5 Pro, a model the host had nearly forgotten was included in the test, tied for first place on the automated scoring, while the newly released Sonnet 5 landed at the bottom of the host's personal preference ranking.

Know someone who'd find this useful?

You just read a 3-minute summary of a 22-minute episode.

Get How I AI summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

Books, tools, and gear mentioned in this episode

SignalCast may earn commission on purchases via these links.

Tools

  • Claude CodeRecommended

    by Anthropic

    Claude Code can scan past session history stored on your desktop to suggest relevant benchmark tasks tailored to your actual workflows, making setup faster and more personalized than starting from scratch.
  • by Runway

    SPONSORS: Runway (https://runwayml.com/howiai)
  • by HyperAgent

    SPONSORS: HyperAgent (https://hyperagent.com/howiai)

More from How I AI

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

Explore Related Topics

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's Investing & Markets Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into How I AI.

Every Monday, we deliver AI summaries of the latest episodes from How I AI and 192+ other podcasts. Free for one show.

Start My Monday Digest

No credit card · Unsubscribe anytime