Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

January 9, 2026

78 min episode · 2 min read

Episode

78 min

Read time

2 min

Topics

Artificial Intelligence

AI-Generated Summary

Published Jan 11, 2026

Key Takeaways

✓Independent Benchmarking Economics: Artificial Analysis runs evaluations costing hundreds to thousands of dollars monthly, using mystery shopper policies with unidentified accounts to prevent labs from optimizing specific endpoints. They maintain independence by never accepting payment for better rankings while monetizing through enterprise subscriptions and private benchmarking services.
✓Intelligence Cost Deflation: GPT-4 level intelligence now costs 100-1000x less than at launch, yet total AI spending increases simultaneously. This paradox occurs because frontier models use 10x more tokens through reasoning chains and agentic workflows, creating a smile curve where both cheap commodity intelligence and expensive frontier capabilities grow.
✓Hallucination Measurement Innovation: The Omniscience Index scores models from negative 100 to positive 100, deducting points for incorrect answers rather than rewarding guesses. Claude models show lowest hallucination rates at 15-20%, while intelligence level shows no correlation with hallucination tendency, revealing post-training recipe differences between labs.
✓Agentic Benchmark Methodology: GDP-VAL AA uses 220 sub-tasks across 44 white-collar job scenarios, running models through their open-source STIRRUP harness with code execution, web search, and context management. Models in custom harnesses outperform their official chatbot versions, with Gemini 3 Pro using 95% confidence intervals requiring multiple evaluation runs.
✓Hardware Efficiency Reality: Blackwell generation GPUs deliver 2-3x throughput gains over Hopper for most workloads, not the marketed 4x, with actual improvements varying by model sparsity. Total parameter count correlates more strongly with knowledge retention than active parameters, suggesting sparse models like Kimi K2 at 3% activation still benefit from larger total sizes.

What It Covers

George Cameron and Micah-Hill Smith explain how Artificial Analysis became the independent benchmarking standard for AI models, covering their methodology for measuring intelligence, speed, cost, hallucination rates, and openness across hundreds of models and providers.

Key Questions Answered

•Independent Benchmarking Economics: Artificial Analysis runs evaluations costing hundreds to thousands of dollars monthly, using mystery shopper policies with unidentified accounts to prevent labs from optimizing specific endpoints. They maintain independence by never accepting payment for better rankings while monetizing through enterprise subscriptions and private benchmarking services.
•Intelligence Cost Deflation: GPT-4 level intelligence now costs 100-1000x less than at launch, yet total AI spending increases simultaneously. This paradox occurs because frontier models use 10x more tokens through reasoning chains and agentic workflows, creating a smile curve where both cheap commodity intelligence and expensive frontier capabilities grow.
•Hallucination Measurement Innovation: The Omniscience Index scores models from negative 100 to positive 100, deducting points for incorrect answers rather than rewarding guesses. Claude models show lowest hallucination rates at 15-20%, while intelligence level shows no correlation with hallucination tendency, revealing post-training recipe differences between labs.
•Agentic Benchmark Methodology: GDP-VAL AA uses 220 sub-tasks across 44 white-collar job scenarios, running models through their open-source STIRRUP harness with code execution, web search, and context management. Models in custom harnesses outperform their official chatbot versions, with Gemini 3 Pro using 95% confidence intervals requiring multiple evaluation runs.
•Hardware Efficiency Reality: Blackwell generation GPUs deliver 2-3x throughput gains over Hopper for most workloads, not the marketed 4x, with actual improvements varying by model sparsity. Total parameter count correlates more strongly with knowledge retention than active parameters, suggesting sparse models like Kimi K2 at 3% activation still benefit from larger total sizes.

Notable Moment

The team revealed they ran DeepSeek V3 evaluations on Boxing Day 2024 in New Zealand, immediately recognizing it as a breakthrough moment before the world noticed weeks later with R1. Their early detection came from systematic tracking of global players beyond mainstream attention.

Know someone who'd find this useful?

You just read a 3-minute summary of a 75-minute episode.

Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Apr 23 · 54 min

The Mel Robbins Podcast

Do THIS Every Day to Rewire Your Brain From Stress and Anxiety

Apr 27

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

Apr 22 · 72 min

The Model Health Show

The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow

Apr 27

Similar Episodes

Related episodes from other podcasts

The Mel Robbins Podcast

Apr 27

685: David Epstein - The Freedom Trap, Narrative Values, General Magic, The Nobel Prize Winner Who Simplified Everything, Wearing the Same Thing Everyday, and Why Constraints Are the Secret to Your Best Work

The AI Breakdown

Apr 26

Where the Economy Thrives After AI

Explore Related Topics

🤖Artificial Intelligence

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into Latent Space.

Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Do THIS Every Day to Rewire Your Brain From Stress and Anxiety

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow

More from Latent Space

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

Extreme Harness Engineering for Token Billionaires: 1M LOC, 1B toks/day, 0% human code, 0% human review — Ryan Lopopolo, OpenAI Frontier & Symphony

Similar Episodes

Do THIS Every Day to Rewire Your Brain From Stress and Anxiety

The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow

664. Britain in the 70s: Scandal in Downing Street (Part 3)

685: David Epstein - The Freedom Trap, Narrative Values, General Magic, The Nobel Prize Winner Who Simplified Everything, Wearing the Same Thing Everyday, and Why Constraints Are the Secret to Your Best Work

Where the Economy Thrives After AI

Explore Related Topics

You're clearly into Latent Space.