Nested Learning: Ali Behrouz on the Quest for Continual Learning & Illusion of AI Architectures
Episode
180 min
Read time
3 min
Topics
Artificial Intelligence, Software Development
AI-Generated Summary
Key Takeaways
- ✓Multi-Frequency MLP Architecture: The HOPE architecture replaces a single MLP block in transformers with multiple MLP blocks updated at different frequencies — for example, every 128, 512, and 2,048 tokens. Slower-updating blocks retain knowledge that faster blocks forget, creating a loop where forgotten skills can re-emerge through backpropagation from stable layers. This directly addresses catastrophic forgetting without requiring separate replay buffers or task-specific fine-tuning strategies.
- ✓Continual Learning Requires Two Phases, Not One: A genuine continual learner eliminates the train/test distinction entirely, but still requires two operational modes: an active phase where inputs arrive and are processed, and a sleep phase where no external input occurs but internal computation continues. Current LLMs fail at continual learning because they freeze parameters post-training and rely on context windows that eventually overflow, making knowledge cutoffs structurally inevitable under the existing paradigm.
- ✓Everything in Deep Learning Is Associative Memory: Backpropagation, attention, RNNs, and optimizers all reduce to the same underlying operation — mapping keys to values through associative memory that compresses a context flow. Behrouz calls current architecture labels an "illusion" because the distinctions between optimizer and architecture dissolve under this lens. The gradient context for an optimizer and the token context for an architecture are structurally equivalent, enabling techniques from one domain to transfer directly to the other.
- ✓Self-Referential Updates Outperform Standard Attention on Sequential Tasks: In Self-Modifying Titan, the value vector in the associative memory is generated by the module's own current parameters rather than a fixed projection — making the update rule itself a function of the current state. This creates a fully sequential, causal process that standard softmax attention cannot replicate. The tradeoff is reduced parallelizability, but the gain is stronger performance on tasks requiring sequential reasoning and temporal dependency tracking.
- ✓Multi-Language In-Context Learning as Architecture Benchmark: When models must simultaneously learn two previously unseen languages (Manchu and MTOB) from in-context grammars and then translate them, standard transformers collapse in performance. HOPE architectures with three frequency levels recover near-single-language performance on both simultaneously. This result directly measures memory management quality — the ability to partition and preserve distinct knowledge streams — rather than simple recall or perplexity, making it a more diagnostic benchmark for continual learning capability.
What It Covers
Cornell researcher and Google scientist Ali Behrouz presents his Nested Learning framework and "Language Models Need Sleep" paper on the Cognitive Revolution podcast. He explains how multi-frequency update architectures (HOPE) enable genuine continual learning, why all deep learning components reduce to associative memory, and how biologically-inspired sleep-phase consolidation could replace the static train/test paradigm in AI systems.
Key Questions Answered
- •Multi-Frequency MLP Architecture: The HOPE architecture replaces a single MLP block in transformers with multiple MLP blocks updated at different frequencies — for example, every 128, 512, and 2,048 tokens. Slower-updating blocks retain knowledge that faster blocks forget, creating a loop where forgotten skills can re-emerge through backpropagation from stable layers. This directly addresses catastrophic forgetting without requiring separate replay buffers or task-specific fine-tuning strategies.
- •Continual Learning Requires Two Phases, Not One: A genuine continual learner eliminates the train/test distinction entirely, but still requires two operational modes: an active phase where inputs arrive and are processed, and a sleep phase where no external input occurs but internal computation continues. Current LLMs fail at continual learning because they freeze parameters post-training and rely on context windows that eventually overflow, making knowledge cutoffs structurally inevitable under the existing paradigm.
- •Everything in Deep Learning Is Associative Memory: Backpropagation, attention, RNNs, and optimizers all reduce to the same underlying operation — mapping keys to values through associative memory that compresses a context flow. Behrouz calls current architecture labels an "illusion" because the distinctions between optimizer and architecture dissolve under this lens. The gradient context for an optimizer and the token context for an architecture are structurally equivalent, enabling techniques from one domain to transfer directly to the other.
- •Self-Referential Updates Outperform Standard Attention on Sequential Tasks: In Self-Modifying Titan, the value vector in the associative memory is generated by the module's own current parameters rather than a fixed projection — making the update rule itself a function of the current state. This creates a fully sequential, causal process that standard softmax attention cannot replicate. The tradeoff is reduced parallelizability, but the gain is stronger performance on tasks requiring sequential reasoning and temporal dependency tracking.
- •Multi-Language In-Context Learning as Architecture Benchmark: When models must simultaneously learn two previously unseen languages (Manchu and MTOB) from in-context grammars and then translate them, standard transformers collapse in performance. HOPE architectures with three frequency levels recover near-single-language performance on both simultaneously. This result directly measures memory management quality — the ability to partition and preserve distinct knowledge streams — rather than simple recall or perplexity, making it a more diagnostic benchmark for continual learning capability.
- •Sleep-Phase Distillation Transfers Knowledge Across Frequency Levels: During the sleep phase, knowledge moves from fast-updating MLP blocks to slower ones via policy distillation — the fast block generates synthetic data from its current state, and the slow block trains on that data. This forces a compression step that produces higher-level abstractions rather than simple parameter copying. New parameters are added to the slower block before distillation to create capacity, and periodic pruning prevents unbounded model growth over a continual learning lifetime.
- •The M3 Optimizer Applies Nested Learning to Gradient Compression: By extending the Muon optimizer with two momentum buffers updated at different frequencies — mirroring the HOPE architecture's multi-frequency MLP design — the M3 optimizer outperforms both Adam and Muon on tested benchmarks. The faster momentum buffer tracks local gradient patterns while the slower one captures global loss landscape structure. The computational overhead is offset by faster convergence, and the result demonstrates that the nested frequency principle transfers from architecture design directly into optimization algorithm design.
Notable Moment
Behrouz argues that needle-in-a-haystack recall benchmarks are structurally biased toward transformers and should not be treated as general architecture comparisons. He notes that no human could perform perfect verbatim recall from thousands of tokens, making transformer success on these tasks a reflection of architectural quirk rather than general intelligence — a reframing that challenges how the field currently ranks competing architectures.
You just read a 3-minute summary of a 177-minute episode.
Get Cognitive Revolution summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Cognitive Revolution
Inside Nathan's Second Brain: Daniel Miessler, Security Expert & Creator of PAI, Audits My AI Setup
May 30 · 152 min
The Biotech Startups Podcast
🧬 AI Psychosis, Coordination Tax & the Limits of LLMs | Alex Telford (2/4)
Jun 4
More from Cognitive Revolution
Your Biggest Lever: Designing your AI Career for Maximum Impact, with 80,000 Hours founder Ben Todd
May 26 · 102 min
The Intelligence (Economist)
A murder exploited: Britain’s George Floyd moment that wasn’t
Jun 4
More from Cognitive Revolution
We summarize every new episode. Want them in your inbox?
Inside Nathan's Second Brain: Daniel Miessler, Security Expert & Creator of PAI, Audits My AI Setup
Your Biggest Lever: Designing your AI Career for Maximum Impact, with 80,000 Hours founder Ben Todd
All Compute Is Food: Palisade's Jeffrey Ladish on AI Shutdown Resistance, Self-Replication & Ecology
The Model Eats the Scaffolding: DeepMind's Logan Kilpatrick & Tulsee Doshi on 3.5 Flash, Omni & More
Three Kinds of Software Survive: Tasklet's Andrew Lee on Competing to be a Horizontal Platform
Similar Episodes
Related episodes from other podcasts
The Biotech Startups Podcast
Jun 4
🧬 AI Psychosis, Coordination Tax & the Limits of LLMs | Alex Telford (2/4)
The Intelligence (Economist)
Jun 4
A murder exploited: Britain’s George Floyd moment that wasn’t
a16z Podcast
Jun 4
AI Eats the World? A Reality Check with Benedict Evans
Practical AI
Jun 4
Breaking down the 2026 Stanford AI Index Report
Odd Lots
Jun 4
Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into Cognitive Revolution.
Every Monday, we deliver AI summaries of the latest episodes from Cognitive Revolution and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime