What are the key takeaways from this How I AI episode?

Key insights include: **Benchmark design:** Build repeatable AI evals using frozen inputs, blind scoring, and a structured rubric rather than one-off vibe checks. Claude Code can scan past session history stored on your desktop to suggest relevant benchmark tasks tailored to your actual workflows, making setup faster and more personalized than starting from scratch.; **Sonnet 5 pricing window:** Sonnet 5 launches at $2 per million input tokens and $10 per million output tokens, with Anthropic confirming prices rise after summer 2025. For teams running high-volume agentic workloads, testing and locking in usage now captures near-Opus performance at a significant cost discount before the pricing structure changes.; **Model-by-task routing:** No single frontier model wins across all tasks. GPT-5.5 produces the most comprehensive PRDs, Sonnet 4.6 performs best for UI prototyping and conversational agents, and Opus 4.8 handles dense, complex UI generation. Routing prompts to the right model by task type outperforms defaulting to one model for everything.

How long is this episode of How I AI?

This episode is 25 minutes long. SignalCast provides an AI-generated summary so you can get the key insights in about 3 minutes.

How I AI

Sonnet 5 review: I ran 64 generations to find out if it's worth it

June 30, 2026

25 min episode · 2 min read

Episode

25 min

Read time

2 min

Topics

Investing, Fundraising & VC, Design & UX

AI-Generated Summary

Published Jul 1, 2026

Key Takeaways

✓Benchmark design: Build repeatable AI evals using frozen inputs, blind scoring, and a structured rubric rather than one-off vibe checks. Claude Code can scan past session history stored on your desktop to suggest relevant benchmark tasks tailored to your actual workflows, making setup faster and more personalized than starting from scratch.
✓Sonnet 5 pricing window: Sonnet 5 launches at $2 per million input tokens and $10 per million output tokens, with Anthropic confirming prices rise after summer 2025. For teams running high-volume agentic workloads, testing and locking in usage now captures near-Opus performance at a significant cost discount before the pricing structure changes.
✓Model-by-task routing: No single frontier model wins across all tasks. GPT-5.5 produces the most comprehensive PRDs, Sonnet 4.6 performs best for UI prototyping and conversational agents, and Opus 4.8 handles dense, complex UI generation. Routing prompts to the right model by task type outperforms defaulting to one model for everything.
✓Human vs. LLM judgment gap: When the host's 70% human-weighted scores were combined with 30% automated LLM scores, rankings flipped significantly from pure LLM evaluation. LLM judges cluster scores near the middle of the scale and miss visual taste signals, making human review essential for design and writing quality assessments.
✓Agentic benchmark saturation: Standard multi-step coding tasks no longer differentiate frontier models because GPT-5.5, Gemini 2.5 Pro, Opus 4.8, and Sonnet 5 all score similarly. Effective agentic evals need harder, more specialized tasks. Retiring saturated benchmarks and replacing them with higher-difficulty challenges is necessary to surface meaningful capability differences between models.

What It Covers

Host introduces the "How I AI Bench," a repeatable evaluation framework testing Claude Sonnet 5 against GPT-5.5, Gemini 2.5 Pro, Opus 4.8, and Sonnet 4.6 across 64 generations spanning PRD writing, UI prototyping, agentic coding, and voice personality tasks, revealing surprising model rankings.

Key Questions Answered

•Benchmark design: Build repeatable AI evals using frozen inputs, blind scoring, and a structured rubric rather than one-off vibe checks. Claude Code can scan past session history stored on your desktop to suggest relevant benchmark tasks tailored to your actual workflows, making setup faster and more personalized than starting from scratch.
•Sonnet 5 pricing window: Sonnet 5 launches at $2 per million input tokens and $10 per million output tokens, with Anthropic confirming prices rise after summer 2025. For teams running high-volume agentic workloads, testing and locking in usage now captures near-Opus performance at a significant cost discount before the pricing structure changes.
•Model-by-task routing: No single frontier model wins across all tasks. GPT-5.5 produces the most comprehensive PRDs, Sonnet 4.6 performs best for UI prototyping and conversational agents, and Opus 4.8 handles dense, complex UI generation. Routing prompts to the right model by task type outperforms defaulting to one model for everything.
•Human vs. LLM judgment gap: When the host's 70% human-weighted scores were combined with 30% automated LLM scores, rankings flipped significantly from pure LLM evaluation. LLM judges cluster scores near the middle of the scale and miss visual taste signals, making human review essential for design and writing quality assessments.
•Agentic benchmark saturation: Standard multi-step coding tasks no longer differentiate frontier models because GPT-5.5, Gemini 2.5 Pro, Opus 4.8, and Sonnet 5 all score similarly. Effective agentic evals need harder, more specialized tasks. Retiring saturated benchmarks and replacing them with higher-difficulty challenges is necessary to surface meaningful capability differences between models.

Notable Moment

The live leaderboard reveal produced an unexpected result: Gemini 2.5 Pro, a model the host had nearly forgotten was included in the test, tied for first place on the automated scoring, while the newly released Sonnet 5 landed at the bottom of the host's personal preference ranking.

Know someone who'd find this useful?

You just read a 3-minute summary of a 22-minute episode.

Get How I AI summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Books, tools, and gear mentioned in this episode

SignalCast may earn commission on purchases via these links.

Tools

Claude CodeRecommended
by Anthropic
“Claude Code can scan past session history stored on your desktop to suggest relevant benchmark tasks tailored to your actual workflows, making setup faster and more personalized than starting from scratch.”
Runway
by Runway
“SPONSORS: Runway (https://runwayml.com/howiai)”
HyperAgent
by HyperAgent
“SPONSORS: HyperAgent (https://hyperagent.com/howiai)”

Similar Episodes

Related episodes from other podcasts

No Priors: Artificial Intelligence | Technology | Startups

Jun 26

Explore Related Topics

📈Investing 💰Fundraising & VC 🎨Design & UX

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's Investing & Markets Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into How I AI.

Every Monday, we deliver AI summaries of the latest episodes from How I AI and 192+ other podcasts. Free for one show.

Start My Monday Digest

No credit card · Unsubscribe anytime

Sonnet 5 review: I ran 64 generations to find out if it's worth it

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

No Figma. No Jira. No docs. How Gusto built a new product line with Claude Code | Eddie Kim (CTO)

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

GLM 5.2: why I’m replacing Opus in Claude Code with this new model

Fable 5 Raises the Bar for AI Ambition

Books, tools, and gear mentioned in this episode

Tools

More from How I AI

No Figma. No Jira. No docs. How Gusto built a new product line with Claude Code | Eddie Kim (CTO)

GLM 5.2: why I’m replacing Opus in Claude Code with this new model

How Claude Mythos found a 15-year-old bug in Mozilla Firefox | Brian Grinstead

How to design AI agent loops: schedules, goals, and subagents in Claude Code and Codex

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

Similar Episodes

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

Fable 5 Raises the Bar for AI Ambition

How Women Can Improve Their Fertility & Hormone Health | Dr. Natalie Crawford

Introducing Maturity Maps — A New Way to Measure AI Adoption

Artificial Analysis: Independent LLM Evals as a Service — with George Cameron and Micah-Hill Smith

Explore Related Topics

You're clearly into How I AI.