
AI Summary
→ WHAT IT COVERS George Cameron and Micah-Hill Smith from Artificial Analysis explain how their independent LLM evaluation service benchmarks AI models across intelligence, speed, cost, and openness. They detail their business model serving enterprise clients, methodology for preventing manipulation, new evaluation frameworks including GDP-Val for agentic tasks, and trends showing intelligence costs dropping 100-1000x while total spending increases through reasoning models and multi-turn workflows. → KEY INSIGHTS - **Independent Benchmarking Infrastructure:** Artificial Analysis runs evaluations using mystery shopper accounts on public endpoints to verify labs cannot manipulate results through special access. They repeat evaluations multiple times to achieve 95% confidence intervals (adding significant cost), use standardized prompting across all models, and maintain complete separation between their free public benchmarks and paid enterprise services to preserve independence and credibility in the market. - **Intelligence Cost Paradox:** The cost to achieve GPT-4 level intelligence has dropped 100-1000x since launch (with Amazon Nova models representing the extreme), yet total AI spending increases dramatically. This occurs because frontier models now use 10x more tokens through reasoning chains, developers deploy them in multi-turn agentic workflows consuming massive input tokens, and applications demand higher intelligence levels than GPT-4 provided, creating insatiable demand despite efficiency gains. - **Hallucination Measurement Framework:** The Omniscience Index scores models from negative 100 to positive 100 by subtracting points for incorrect answers versus rewarding "I don't know" responses. Claude models show lowest hallucination rates despite not having highest intelligence scores, revealing no strong correlation between general intelligence and hallucination tendency. This metric uses 90% held-out test sets to prevent data contamination on factual knowledge questions. - **GDP-Val Agentic Evaluation:** Artificial Analysis created an open-source agentic harness called Stirrup that outperforms official lab chatbots on GDP-Val tasks by 10-20 percentage points. The harness provides minimal tools (code execution, web search, file system access) and lets models run up to 100 turns. They use Gemini 3.0 Pro as an LLM judge to compare document outputs, achieving high human preference alignment by separating task execution from evaluation methodology. - **Hardware Efficiency Reality:** Single GPU throughput gains from Hopper to Blackwell generation exceed the commonly cited 2-3x improvement, especially for large sparse models at realistic serving speeds. The throughput-per-GPU versus per-user-speed tradeoff means faster serving costs more. Total parameter count (not active parameters) correlates most strongly with knowledge retention, with current sparse models running 3-5% active parameters, suggesting significant room for continued scaling through sparsity approaches. - **Token Efficiency Evolution:** Models increasingly use more tokens only when needed for difficult questions, with correlation between token usage and question difficulty improving throughout 2024. In telecommunications applications, GPT-5 costs less per resolved query than smaller open-source models despite higher per-token pricing because it reaches solutions in fewer turns. Number of turns to completion emerges as a critical efficiency metric alongside per-token cost for real-world deployment economics. → NOTABLE MOMENT The founders revealed they discovered DeepSeek V3's frontier capabilities on Boxing Day 2024 while running evaluations during family Christmas in New Zealand. This model scored close to OpenAI's leadership position before the world noticed DeepSeek's breakthrough, demonstrating how independent benchmarking can identify major capability shifts before they become widely recognized through the later release of reasoning models. 💼 SPONSORS None detected 🏷️ LLM Benchmarking, AI Evaluation, Model Intelligence, Agentic AI, Reasoning Models, AI Cost Optimization