Understanding the Most Viral Chart in Artificial Intelligence

April 25, 2026

56 min episode · 2 min read

Joel Becker,Chris Painter

Episode

56 min

Read time

2 min

Topics

Artificial Intelligence

AI-Generated Summary

Published Apr 25, 2026

Key Takeaways

✓Time Horizon Methodology: METR measures AI capability by timing skilled human engineers completing identical tasks, then testing AI on the same tasks. The "time horizon" is the task length at which AI succeeds 50% of the time. Claude Opus 4.6 reaches 11 hours 59 minutes, nearly doubling GPT Codex's previous 5-hour 50-minute benchmark.
✓50% Threshold vs. 80%: METR defaults to the 50% success threshold rather than 80% for statistical reasons: measuring at 50% requires fewer samples and is least sensitive to scoring noise. The 80% chart shows the same doubling pace but at roughly one-fifth the task length, meaning current 80% performance matches today's 50% performance within approximately eight months.
✓Doubling Rate Revision: METR initially published a seven-month capability doubling time but revised it to four months after newer models consistently matched the faster trend. Compute investment has grown at essentially the same exponential rate as capability gains, and already-committed data center buildouts through 2027-2028 make a near-term slowdown unlikely regardless of other variables.
✓Benchmark vs. Real-World Gap: AI time horizon scores overestimate real-world productivity gains for several reasons: holistic code quality standards differ from automated scoring, real tasks involve larger codebases and collaboration, and verification of AI-generated work requires extra time without the engineer's original context. These frictions are real but not considered fundamental barriers to eventual productivity gains.
✓Chinese Model Gap: Chinese models including Qwen do not appear on METR's main time horizon charts because they trail US frontier models by an estimated nine to twelve months on task capability. METR also notes Chinese benchmark scores may overstate actual held-out task performance relative to US models, making the capability gap potentially larger than raw benchmark comparisons suggest.

What It Covers

METR, a 30-person San Francisco nonprofit, created the most viral chart in AI: a "time horizon" graph measuring how AI models perform on engineering tasks scaled by human completion time. Claude Opus 4.6 now completes tasks requiring nearly 12 human hours at 50% success rate, doubling roughly every four months.

Key Questions Answered

•Time Horizon Methodology: METR measures AI capability by timing skilled human engineers completing identical tasks, then testing AI on the same tasks. The "time horizon" is the task length at which AI succeeds 50% of the time. Claude Opus 4.6 reaches 11 hours 59 minutes, nearly doubling GPT Codex's previous 5-hour 50-minute benchmark.
•50% Threshold vs. 80%: METR defaults to the 50% success threshold rather than 80% for statistical reasons: measuring at 50% requires fewer samples and is least sensitive to scoring noise. The 80% chart shows the same doubling pace but at roughly one-fifth the task length, meaning current 80% performance matches today's 50% performance within approximately eight months.
•Doubling Rate Revision: METR initially published a seven-month capability doubling time but revised it to four months after newer models consistently matched the faster trend. Compute investment has grown at essentially the same exponential rate as capability gains, and already-committed data center buildouts through 2027-2028 make a near-term slowdown unlikely regardless of other variables.
•Benchmark vs. Real-World Gap: AI time horizon scores overestimate real-world productivity gains for several reasons: holistic code quality standards differ from automated scoring, real tasks involve larger codebases and collaboration, and verification of AI-generated work requires extra time without the engineer's original context. These frictions are real but not considered fundamental barriers to eventual productivity gains.
•Chinese Model Gap: Chinese models including Qwen do not appear on METR's main time horizon charts because they trail US frontier models by an estimated nine to twelve months on task capability. METR also notes Chinese benchmark scores may overstate actual held-out task performance relative to US models, making the capability gap potentially larger than raw benchmark comparisons suggest.

Notable Moment

When asked about fully autonomous AI-to-AI collaboration today, METR's Joel Becker described current systems as eventually "falling on their faces" without human idea generation — the human still provides the concept while AI handles execution, meaning true autonomous research loops remain beyond present capability.

Know someone who'd find this useful?