Is AI About to “Eat Everything”? | AI Reality Check

May 14, 2026

31 min episode · 2 min read

Episode

31 min

Read time

2 min

Topics

Artificial Intelligence

AI-Generated Summary

Published May 14, 2026

Key Takeaways

✓METR Chart Interpretation: The chart measures only specific software tasks, not general AI capability. When Claude Opus 4.5 plots at 4 hours 53 minutes, it means one particular programming task took humans that long, and the model completes it 50% of the time — not that AI can now perform any 5-hour human task. Success threshold matters: at 80% reliability, the best model handles only 3-hour tasks.
✓Pre-training vs. Post-training Shift: From GPT-2 through 2024, AI companies scaled pre-training (more data, longer runs) until hitting a capability wall. The 2024 pivot to post-training — using reinforcement learning on narrow, right-or-wrong datasets like compilable code — is what drove the programming benchmark jumps visible in the METR chart starting around late 2024 and accelerating through 2026.
✓Coding Harnesses Drive the Leap: The exponential jump in programming benchmarks reflects not just better LLMs but 12–18 months of intensive development on coding harnesses like Claude Code and Cursor. These harnesses contain substantial hand-coded, expert-system-style logic — giant conditional statements, external tool integrations, verification loops — built by programmers who encoded their domain expertise directly into the scaffolding surrounding the LLM.
✓River vs. Water Mental Model: Treat AI progress as exploring tributaries, not a rising water level. Software development proved a navigable tributary after two years of focused effort. Other applications — like AI email management — hit dead ends quickly. One tributary's depth reveals nothing about adjacent ones. Evaluate each AI application independently based on its own tooling investment and domain fit, not by extrapolating from programming benchmarks.
✓Broader Capability Index Shows Linear Growth: The EPOC Capabilities Index, which measures AI performance across multiple domains rather than just programming, shows slow, steady, linear improvement across the same period that METR's programming chart shows exponential gains. This confirms the programming jump is domain-specific, driven by targeted investment, not evidence of across-the-board intelligence acceleration.

What It Covers

Cal Newport decodes the METR AI time horizon chart, which tracks the longest software task (measured in human completion time) that LLM-plus-coding-harness combinations can complete at 50% success rate, explaining why the recent exponential-looking jump reflects narrow programming tool development, not general AI capability acceleration.

Key Questions Answered

•METR Chart Interpretation: The chart measures only specific software tasks, not general AI capability. When Claude Opus 4.5 plots at 4 hours 53 minutes, it means one particular programming task took humans that long, and the model completes it 50% of the time — not that AI can now perform any 5-hour human task. Success threshold matters: at 80% reliability, the best model handles only 3-hour tasks.
•Pre-training vs. Post-training Shift: From GPT-2 through 2024, AI companies scaled pre-training (more data, longer runs) until hitting a capability wall. The 2024 pivot to post-training — using reinforcement learning on narrow, right-or-wrong datasets like compilable code — is what drove the programming benchmark jumps visible in the METR chart starting around late 2024 and accelerating through 2026.
•Coding Harnesses Drive the Leap: The exponential jump in programming benchmarks reflects not just better LLMs but 12–18 months of intensive development on coding harnesses like Claude Code and Cursor. These harnesses contain substantial hand-coded, expert-system-style logic — giant conditional statements, external tool integrations, verification loops — built by programmers who encoded their domain expertise directly into the scaffolding surrounding the LLM.
•River vs. Water Mental Model: Treat AI progress as exploring tributaries, not a rising water level. Software development proved a navigable tributary after two years of focused effort. Other applications — like AI email management — hit dead ends quickly. One tributary's depth reveals nothing about adjacent ones. Evaluate each AI application independently based on its own tooling investment and domain fit, not by extrapolating from programming benchmarks.
•Broader Capability Index Shows Linear Growth: The EPOC Capabilities Index, which measures AI performance across multiple domains rather than just programming, shows slow, steady, linear improvement across the same period that METR's programming chart shows exponential gains. This confirms the programming jump is domain-specific, driven by targeted investment, not evidence of across-the-board intelligence acceleration.

Notable Moment

Newport reveals that Anthropic's Claude Code source code leaked because a model trained to detect security vulnerabilities had one itself. The leaked code exposed how much old-fashioned, hand-written expert-system logic powers the harness — undermining narratives that recent AI leaps stem purely from emergent model intelligence.

Know someone who'd find this useful?