George Cameron

Artificial Analysis: Independent LLM Evals as a Service — with George Cameron and Micah-Hill Smith

Jan 8, 202678 min

AI Summary

→ WHAT IT COVERS George Cameron and Micah-Hill Smith from Artificial Analysis explain how their independent LLM evaluation service benchmarks AI models across intelligence, speed, cost, and openness. They detail their business model serving enterprise clients, methodology for preventing manipulation, new evaluation frameworks including GDP-Val for agentic tasks, and trends showing intelligence costs dropping 100-1000x while total spending increases through reasoning models and multi-turn workflows. → KEY INSIGHTS - **Independent Benchmarking Infrastructure:** Artificial Analysis runs evaluations using mystery shopper accounts on public endpoints to verify labs cannot manipulate results through special access. They repeat evaluations multiple times to achieve 95% confidence intervals (adding significant cost), use standardized prompting across all models, and maintain complete separation between their free public benchmarks and paid enterprise services to preserve independence and credibility in the market. - **Intelligence Cost Paradox:** The cost to achieve GPT-4 level intelligence has dropped 100-1000x since launch (with Amazon Nova models representing the extreme), yet total AI spending increases dramatically. This occurs because frontier models now use 10x more tokens through reasoning chains, developers deploy them in multi-turn agentic workflows consuming massive input tokens, and applications demand higher intelligence levels than GPT-4 provided, creating insatiable demand despite efficiency gains. - **Hallucination Measurement Framework:** The Omniscience Index scores models from negative 100 to positive 100 by subtracting points for incorrect answers versus rewarding "I don't know" responses. Claude models show lowest hallucination rates despite not having highest intelligence scores, revealing no strong correlation between general intelligence and hallucination tendency. This metric uses 90% held-out test sets to prevent data contamination on factual knowledge questions. - **GDP-Val Agentic Evaluation:** Artificial Analysis created an open-source agentic harness called Stirrup that outperforms official lab chatbots on GDP-Val tasks by 10-20 percentage points. The harness provides minimal tools (code execution, web search, file system access) and lets models run up to 100 turns. They use Gemini 3.0 Pro as an LLM judge to compare document outputs, achieving high human preference alignment by separating task execution from evaluation methodology. - **Hardware Efficiency Reality:** Single GPU throughput gains from Hopper to Blackwell generation exceed the commonly cited 2-3x improvement, especially for large sparse models at realistic serving speeds. The throughput-per-GPU versus per-user-speed tradeoff means faster serving costs more. Total parameter count (not active parameters) correlates most strongly with knowledge retention, with current sparse models running 3-5% active parameters, suggesting significant room for continued scaling through sparsity approaches. - **Token Efficiency Evolution:** Models increasingly use more tokens only when needed for difficult questions, with correlation between token usage and question difficulty improving throughout 2024. In telecommunications applications, GPT-5 costs less per resolved query than smaller open-source models despite higher per-token pricing because it reaches solutions in fewer turns. Number of turns to completion emerges as a critical efficiency metric alongside per-token cost for real-world deployment economics. → NOTABLE MOMENT The founders revealed they discovered DeepSeek V3's frontier capabilities on Boxing Day 2024 while running evaluations during family Christmas in New Zealand. This model scored close to OpenAI's leadership position before the world noticed DeepSeek's breakthrough, demonstrating how independent benchmarking can identify major capability shifts before they become widely recognized through the later release of reasoning models. 💼 SPONSORS None detected 🏷️ LLM Benchmarking, AI Evaluation, Model Intelligence, Agentic AI, Reasoning Models, AI Cost Optimization

Read Full Summary Listen

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space

Jan 8, 202678 min

AI Summary

→ WHAT IT COVERS George Cameron and Micah-Hill Smith explain how Artificial Analysis became the independent benchmarking standard for AI models, covering their methodology for measuring intelligence, speed, cost, hallucination rates, and openness across hundreds of models and providers. → KEY INSIGHTS - **Independent Benchmarking Economics:** Artificial Analysis runs evaluations costing hundreds to thousands of dollars monthly, using mystery shopper policies with unidentified accounts to prevent labs from optimizing specific endpoints. They maintain independence by never accepting payment for better rankings while monetizing through enterprise subscriptions and private benchmarking services. - **Intelligence Cost Deflation:** GPT-4 level intelligence now costs 100-1000x less than at launch, yet total AI spending increases simultaneously. This paradox occurs because frontier models use 10x more tokens through reasoning chains and agentic workflows, creating a smile curve where both cheap commodity intelligence and expensive frontier capabilities grow. - **Hallucination Measurement Innovation:** The Omniscience Index scores models from negative 100 to positive 100, deducting points for incorrect answers rather than rewarding guesses. Claude models show lowest hallucination rates at 15-20%, while intelligence level shows no correlation with hallucination tendency, revealing post-training recipe differences between labs. - **Agentic Benchmark Methodology:** GDP-VAL AA uses 220 sub-tasks across 44 white-collar job scenarios, running models through their open-source STIRRUP harness with code execution, web search, and context management. Models in custom harnesses outperform their official chatbot versions, with Gemini 3 Pro using 95% confidence intervals requiring multiple evaluation runs. - **Hardware Efficiency Reality:** Blackwell generation GPUs deliver 2-3x throughput gains over Hopper for most workloads, not the marketed 4x, with actual improvements varying by model sparsity. Total parameter count correlates more strongly with knowledge retention than active parameters, suggesting sparse models like Kimi K2 at 3% activation still benefit from larger total sizes. → NOTABLE MOMENT The team revealed they ran DeepSeek V3 evaluations on Boxing Day 2024 in New Zealand, immediately recognizing it as a breakthrough moment before the world noticed weeks later with R1. Their early detection came from systematic tracking of global players beyond mainstream attention. 💼 SPONSORS None detected 🏷️ LLM Benchmarking, Model Evaluation, AI Infrastructure, Reasoning Models, Agentic AI

Read Full Summary Listen

Featured On 1 Podcast

Latent Space

All Appearances

Artificial Analysis: Independent LLM Evals as a Service — with George Cameron and Micah-Hill Smith

AI Summary

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

AI Summary

Never miss George Cameron's insights