Artificial Analysis: Independent LLM Evals as a Service — with George Cameron and Micah-Hill Smith
Episode
78 min
Read time
3 min
Topics
Artificial Intelligence
AI-Generated Summary
Key Takeaways
- ✓Independent Benchmarking Infrastructure: Artificial Analysis runs evaluations using mystery shopper accounts on public endpoints to verify labs cannot manipulate results through special access. They repeat evaluations multiple times to achieve 95% confidence intervals (adding significant cost), use standardized prompting across all models, and maintain complete separation between their free public benchmarks and paid enterprise services to preserve independence and credibility in the market.
- ✓Intelligence Cost Paradox: The cost to achieve GPT-4 level intelligence has dropped 100-1000x since launch (with Amazon Nova models representing the extreme), yet total AI spending increases dramatically. This occurs because frontier models now use 10x more tokens through reasoning chains, developers deploy them in multi-turn agentic workflows consuming massive input tokens, and applications demand higher intelligence levels than GPT-4 provided, creating insatiable demand despite efficiency gains.
- ✓Hallucination Measurement Framework: The Omniscience Index scores models from negative 100 to positive 100 by subtracting points for incorrect answers versus rewarding "I don't know" responses. Claude models show lowest hallucination rates despite not having highest intelligence scores, revealing no strong correlation between general intelligence and hallucination tendency. This metric uses 90% held-out test sets to prevent data contamination on factual knowledge questions.
- ✓GDP-Val Agentic Evaluation: Artificial Analysis created an open-source agentic harness called Stirrup that outperforms official lab chatbots on GDP-Val tasks by 10-20 percentage points. The harness provides minimal tools (code execution, web search, file system access) and lets models run up to 100 turns. They use Gemini 3.0 Pro as an LLM judge to compare document outputs, achieving high human preference alignment by separating task execution from evaluation methodology.
- ✓Hardware Efficiency Reality: Single GPU throughput gains from Hopper to Blackwell generation exceed the commonly cited 2-3x improvement, especially for large sparse models at realistic serving speeds. The throughput-per-GPU versus per-user-speed tradeoff means faster serving costs more. Total parameter count (not active parameters) correlates most strongly with knowledge retention, with current sparse models running 3-5% active parameters, suggesting significant room for continued scaling through sparsity approaches.
What It Covers
George Cameron and Micah-Hill Smith from Artificial Analysis explain how their independent LLM evaluation service benchmarks AI models across intelligence, speed, cost, and openness. They detail their business model serving enterprise clients, methodology for preventing manipulation, new evaluation frameworks including GDP-Val for agentic tasks, and trends showing intelligence costs dropping 100-1000x while total spending increases through reasoning models and multi-turn workflows.
Key Questions Answered
- •Independent Benchmarking Infrastructure: Artificial Analysis runs evaluations using mystery shopper accounts on public endpoints to verify labs cannot manipulate results through special access. They repeat evaluations multiple times to achieve 95% confidence intervals (adding significant cost), use standardized prompting across all models, and maintain complete separation between their free public benchmarks and paid enterprise services to preserve independence and credibility in the market.
- •Intelligence Cost Paradox: The cost to achieve GPT-4 level intelligence has dropped 100-1000x since launch (with Amazon Nova models representing the extreme), yet total AI spending increases dramatically. This occurs because frontier models now use 10x more tokens through reasoning chains, developers deploy them in multi-turn agentic workflows consuming massive input tokens, and applications demand higher intelligence levels than GPT-4 provided, creating insatiable demand despite efficiency gains.
- •Hallucination Measurement Framework: The Omniscience Index scores models from negative 100 to positive 100 by subtracting points for incorrect answers versus rewarding "I don't know" responses. Claude models show lowest hallucination rates despite not having highest intelligence scores, revealing no strong correlation between general intelligence and hallucination tendency. This metric uses 90% held-out test sets to prevent data contamination on factual knowledge questions.
- •GDP-Val Agentic Evaluation: Artificial Analysis created an open-source agentic harness called Stirrup that outperforms official lab chatbots on GDP-Val tasks by 10-20 percentage points. The harness provides minimal tools (code execution, web search, file system access) and lets models run up to 100 turns. They use Gemini 3.0 Pro as an LLM judge to compare document outputs, achieving high human preference alignment by separating task execution from evaluation methodology.
- •Hardware Efficiency Reality: Single GPU throughput gains from Hopper to Blackwell generation exceed the commonly cited 2-3x improvement, especially for large sparse models at realistic serving speeds. The throughput-per-GPU versus per-user-speed tradeoff means faster serving costs more. Total parameter count (not active parameters) correlates most strongly with knowledge retention, with current sparse models running 3-5% active parameters, suggesting significant room for continued scaling through sparsity approaches.
- •Token Efficiency Evolution: Models increasingly use more tokens only when needed for difficult questions, with correlation between token usage and question difficulty improving throughout 2024. In telecommunications applications, GPT-5 costs less per resolved query than smaller open-source models despite higher per-token pricing because it reaches solutions in fewer turns. Number of turns to completion emerges as a critical efficiency metric alongside per-token cost for real-world deployment economics.
Notable Moment
The founders revealed they discovered DeepSeek V3's frontier capabilities on Boxing Day 2024 while running evaluations during family Christmas in New Zealand. This model scored close to OpenAI's leadership position before the world noticed DeepSeek's breakthrough, demonstrating how independent benchmarking can identify major capability shifts before they become widely recognized through the later release of reasoning models.
You just read a 3-minute summary of a 75-minute episode.
Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Latent Space
Physical AI that Moves the World — Qasar Younis & Peter Ludwig, Applied Intuition
Apr 27 · 72 min
Morning Brew Daily
Jerome Powell Ain’t Leavin’ Yet & Movie Tickets Cost $50!?
Apr 30
More from Latent Space
AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)
Apr 23 · 54 min
a16z Podcast
Workday’s Last Workday? AI and the Future of Enterprise Software
Apr 30
More from Latent Space
We summarize every new episode. Want them in your inbox?
Physical AI that Moves the World — Qasar Younis & Peter Ludwig, Applied Intuition
AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)
Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO
🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik
Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion
Similar Episodes
Related episodes from other podcasts
Morning Brew Daily
Apr 30
Jerome Powell Ain’t Leavin’ Yet & Movie Tickets Cost $50!?
a16z Podcast
Apr 30
Workday’s Last Workday? AI and the Future of Enterprise Software
Masters of Scale
Apr 30
How Poppi’s founders built a new soda brand worth $2 billion
Snacks Daily
Apr 30
🦸♀️ “MAMA Stocks” — Zuck’s Ad/AI machine. Hilary Duff’s anti-Ozempic bet. Bill Ackman’s Influencer IPO. +Refresher surge
The Mel Robbins Podcast
Apr 30
Eat This to Live Longer, Stay Young, and Transform Your Health
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into Latent Space.
Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime