Sonnet 5 review: I ran 64 generations to find out if it's worth it
Episode
25 min
Read time
2 min
Topics
Investing, Fundraising & VC, Design & UX
AI-Generated Summary
Key Takeaways
- ✓Benchmark design: Build repeatable AI evals using frozen inputs, blind scoring, and a structured rubric rather than one-off vibe checks. Claude Code can scan past session history stored on your desktop to suggest relevant benchmark tasks tailored to your actual workflows, making setup faster and more personalized than starting from scratch.
- ✓Sonnet 5 pricing window: Sonnet 5 launches at $2 per million input tokens and $10 per million output tokens, with Anthropic confirming prices rise after summer 2025. For teams running high-volume agentic workloads, testing and locking in usage now captures near-Opus performance at a significant cost discount before the pricing structure changes.
- ✓Model-by-task routing: No single frontier model wins across all tasks. GPT-5.5 produces the most comprehensive PRDs, Sonnet 4.6 performs best for UI prototyping and conversational agents, and Opus 4.8 handles dense, complex UI generation. Routing prompts to the right model by task type outperforms defaulting to one model for everything.
- ✓Human vs. LLM judgment gap: When the host's 70% human-weighted scores were combined with 30% automated LLM scores, rankings flipped significantly from pure LLM evaluation. LLM judges cluster scores near the middle of the scale and miss visual taste signals, making human review essential for design and writing quality assessments.
- ✓Agentic benchmark saturation: Standard multi-step coding tasks no longer differentiate frontier models because GPT-5.5, Gemini 2.5 Pro, Opus 4.8, and Sonnet 5 all score similarly. Effective agentic evals need harder, more specialized tasks. Retiring saturated benchmarks and replacing them with higher-difficulty challenges is necessary to surface meaningful capability differences between models.
What It Covers
Host introduces the "How I AI Bench," a repeatable evaluation framework testing Claude Sonnet 5 against GPT-5.5, Gemini 2.5 Pro, Opus 4.8, and Sonnet 4.6 across 64 generations spanning PRD writing, UI prototyping, agentic coding, and voice personality tasks, revealing surprising model rankings.
Key Questions Answered
- •Benchmark design: Build repeatable AI evals using frozen inputs, blind scoring, and a structured rubric rather than one-off vibe checks. Claude Code can scan past session history stored on your desktop to suggest relevant benchmark tasks tailored to your actual workflows, making setup faster and more personalized than starting from scratch.
- •Sonnet 5 pricing window: Sonnet 5 launches at $2 per million input tokens and $10 per million output tokens, with Anthropic confirming prices rise after summer 2025. For teams running high-volume agentic workloads, testing and locking in usage now captures near-Opus performance at a significant cost discount before the pricing structure changes.
- •Model-by-task routing: No single frontier model wins across all tasks. GPT-5.5 produces the most comprehensive PRDs, Sonnet 4.6 performs best for UI prototyping and conversational agents, and Opus 4.8 handles dense, complex UI generation. Routing prompts to the right model by task type outperforms defaulting to one model for everything.
- •Human vs. LLM judgment gap: When the host's 70% human-weighted scores were combined with 30% automated LLM scores, rankings flipped significantly from pure LLM evaluation. LLM judges cluster scores near the middle of the scale and miss visual taste signals, making human review essential for design and writing quality assessments.
- •Agentic benchmark saturation: Standard multi-step coding tasks no longer differentiate frontier models because GPT-5.5, Gemini 2.5 Pro, Opus 4.8, and Sonnet 5 all score similarly. Effective agentic evals need harder, more specialized tasks. Retiring saturated benchmarks and replacing them with higher-difficulty challenges is necessary to surface meaningful capability differences between models.
Notable Moment
The live leaderboard reveal produced an unexpected result: Gemini 2.5 Pro, a model the host had nearly forgotten was included in the test, tied for first place on the automated scoring, while the newly released Sonnet 5 landed at the bottom of the host's personal preference ranking.
You just read a 3-minute summary of a 22-minute episode.
Get How I AI summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from How I AI
No Figma. No Jira. No docs. How Gusto built a new product line with Claude Code | Eddie Kim (CTO)
Jun 29 · 51 min
No Priors: Artificial Intelligence | Technology | Startups
Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown
Jun 26
More from How I AI
GLM 5.2: why I’m replacing Opus in Claude Code with this new model
Jun 24 · 27 min
The AI Breakdown
Fable 5 Raises the Bar for AI Ambition
Jun 10
Books, tools, and gear mentioned in this episode
SignalCast may earn commission on purchases via these links.
Tools
- Claude CodeRecommended
by Anthropic
“Claude Code can scan past session history stored on your desktop to suggest relevant benchmark tasks tailored to your actual workflows, making setup faster and more personalized than starting from scratch.”
More from How I AI
We summarize every new episode. Want them in your inbox?
No Figma. No Jira. No docs. How Gusto built a new product line with Claude Code | Eddie Kim (CTO)
GLM 5.2: why I’m replacing Opus in Claude Code with this new model
How Claude Mythos found a 15-year-old bug in Mozilla Firefox | Brian Grinstead
How to design AI agent loops: schedules, goals, and subagents in Claude Code and Codex
How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal
Similar Episodes
Related episodes from other podcasts
No Priors: Artificial Intelligence | Technology | Startups
Jun 26
Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown
The AI Breakdown
Jun 10
Fable 5 Raises the Bar for AI Ambition
Huberman Lab
Apr 13
How Women Can Improve Their Fertility & Hormone Health | Dr. Natalie Crawford
The AI Breakdown
Apr 1
Introducing Maturity Maps — A New Way to Measure AI Adoption
Latent Space
Jan 8
Artificial Analysis: Independent LLM Evals as a Service — with George Cameron and Micah-Hill Smith
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's Investing & Markets Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into How I AI.
Every Monday, we deliver AI summaries of the latest episodes from How I AI and 192+ other podcasts. Free for one show.
Start My Monday DigestNo credit card · Unsubscribe anytime