He Co-Invented the Transformer. Now: Continuous Thought Machines - Llion Jones and Luke Darlow [Sakana AI]
Episode
72 min
Read time
3 min
Topics
Artificial Intelligence
AI-Generated Summary
Key Takeaways
- ✓Architecture Lock-in: Transformers dominate not because alternatives are worse, but because switching costs are prohibitive. Competing architectures must be "crushingly better" — not marginally better — to displace an established system with mature tooling, fine-tuning pipelines, and inference infrastructure. The same pattern occurred when transformers displaced RNNs: the accuracy jump was so large that researchers had no choice but to migrate.
- ✓Continuous Thought Machine Design: The CTM introduces three architectural novelties: an internal sequential "thought" dimension that applies compute across discrete steps, neuron-level models (NLMs) that treat each neuron as a small MLP processing a history of activations rather than a single ReLU, and synchronization representations that measure dot-product correlations between neuron activation time series to encode richer, temporally-aware state.
- ✓Native Adaptive Computation: Training the CTM on ImageNet with a dual-loss — minimizing cross-entropy at both the lowest-loss step and the highest-certainty step — causes easy examples to resolve in one or two steps while hard examples use the full 50-step budget. This adaptive behavior emerges without explicit computation-penalty terms, unlike Alex Graves' Adaptive Computation Time paper, which required carefully tuned auxiliary losses.
- ✓Calibration as Architecture Signal: After standard training, the CTM produced near-perfect probability calibration on classification tasks — meaning a 90% confidence prediction was correct roughly 90% of the time. Most neural networks trained to convergence become poorly calibrated and require post-hoc correction. The CTM's emergent calibration suggests the synchronization-based representation aligns model uncertainty with actual error rates more naturally.
- ✓SudokuBench Reasoning Gap: Sakana AI released SudokuBench, a dataset of handcrafted variant Sudoku puzzles with unique natural-language rule sets, sourced from thousands of hours of Cracking the Cryptic YouTube videos providing detailed human reasoning traces. Current top models solve only the simplest puzzles at around 15% accuracy. GPT-4 shows improvement but cannot find the "break-in" insight each puzzle requires, exposing a fundamental gap in sequential deductive reasoning.
What It Covers
Llion Jones, co-inventor of the transformer, and Sakana AI researcher Luke Darlow discuss the Continuous Thought Machine (CTM), a spotlight paper at NeurIPS 2025. They examine why AI research is trapped in a transformer-centric local minimum, how biological neuron synchronization inspired a new recurrent architecture, and why research freedom produces better science than commercial pressure.
Key Questions Answered
- •Architecture Lock-in: Transformers dominate not because alternatives are worse, but because switching costs are prohibitive. Competing architectures must be "crushingly better" — not marginally better — to displace an established system with mature tooling, fine-tuning pipelines, and inference infrastructure. The same pattern occurred when transformers displaced RNNs: the accuracy jump was so large that researchers had no choice but to migrate.
- •Continuous Thought Machine Design: The CTM introduces three architectural novelties: an internal sequential "thought" dimension that applies compute across discrete steps, neuron-level models (NLMs) that treat each neuron as a small MLP processing a history of activations rather than a single ReLU, and synchronization representations that measure dot-product correlations between neuron activation time series to encode richer, temporally-aware state.
- •Native Adaptive Computation: Training the CTM on ImageNet with a dual-loss — minimizing cross-entropy at both the lowest-loss step and the highest-certainty step — causes easy examples to resolve in one or two steps while hard examples use the full 50-step budget. This adaptive behavior emerges without explicit computation-penalty terms, unlike Alex Graves' Adaptive Computation Time paper, which required carefully tuned auxiliary losses.
- •Calibration as Architecture Signal: After standard training, the CTM produced near-perfect probability calibration on classification tasks — meaning a 90% confidence prediction was correct roughly 90% of the time. Most neural networks trained to convergence become poorly calibrated and require post-hoc correction. The CTM's emergent calibration suggests the synchronization-based representation aligns model uncertainty with actual error rates more naturally.
- •SudokuBench Reasoning Gap: Sakana AI released SudokuBench, a dataset of handcrafted variant Sudoku puzzles with unique natural-language rule sets, sourced from thousands of hours of Cracking the Cryptic YouTube videos providing detailed human reasoning traces. Current top models solve only the simplest puzzles at around 15% accuracy. GPT-4 shows improvement but cannot find the "break-in" insight each puzzle requires, exposing a fundamental gap in sequential deductive reasoning.
- •Research Freedom as Competitive Strategy: Jones argues that protecting researcher autonomy is a primary leadership responsibility at Sakana AI. Commercial pressure — investor return expectations, product deadlines, publication quotas — systematically narrows the solution space researchers explore. The CTM itself emerged from eight months of unconstrained exploration with no predetermined goal, producing emergent behaviors like backtracking maze navigation and leapfrog path-solving under constrained compute budgets.
Notable Moment
During training, the CTM spontaneously developed two distinct maze-solving strategies depending on available compute steps. With sufficient time, it traced paths sequentially. When steps were constrained, it instead leapfrogged ahead, traced segments backward, then jumped forward again — an algorithm the researchers never designed or anticipated, emerging purely from architectural constraints.
You just read a 3-minute summary of a 69-minute episode.
Get Machine Learning Street Talk summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Machine Learning Street Talk
The AI Models Smart Enough to Know They're Cheating — Beth Barnes & David Rein [METR]
May 4 · 113 min
The AI Breakdown
The New Jobs AI Will Create
May 10
More from Machine Learning Street Talk
When AI Discovers The Next Transformer - Robert Lange (Sakana)
Mar 13 · 78 min
Philosophize This!
Episode #247 ... The Failure of the Modern University - Alasdair MacIntyre
May 10
More from Machine Learning Street Talk
We summarize every new episode. Want them in your inbox?
The AI Models Smart Enough to Know They're Cheating — Beth Barnes & David Rein [METR]
When AI Discovers The Next Transformer - Robert Lange (Sakana)
"Vibe Coding is a Slot Machine" - Jeremy Howard
Evolution "Doesn't Need" Mutation - Blaise Agüera y Arcas
VAEs Are Energy-Based Models? [Dr. Jeff Beck]
Similar Episodes
Related episodes from other podcasts
The AI Breakdown
May 10
The New Jobs AI Will Create
Philosophize This!
May 10
Episode #247 ... The Failure of the Modern University - Alasdair MacIntyre
Lenny's Podcast
May 10
How to build a company that withstands any era | Eric Ries, Lean Startup author
Everything Everywhere Daily
May 10
Rainbows And How They Work
The AI Breakdown
May 9
How to Build an AI Native Team with Mike Cannon-Brookes
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into Machine Learning Street Talk.
Every Monday, we deliver AI summaries of the latest episodes from Machine Learning Street Talk and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime