Ali Behrouz

Nested Learning: Ali Behrouz on the Quest for Continual Learning & Illusion of AI Architectures

Jun 3, 2026180 minGrad student at Cornell, researcher at Google, author of Nested Learning

AI Summary

→ WHAT IT COVERS Cornell researcher and Google scientist Ali Behrouz presents his Nested Learning framework and "Language Models Need Sleep" paper on the Cognitive Revolution podcast. He explains how multi-frequency update architectures (HOPE) enable genuine continual learning, why all deep learning components reduce to associative memory, and how biologically-inspired sleep-phase consolidation could replace the static train/test paradigm in AI systems. → KEY INSIGHTS - **Multi-Frequency MLP Architecture:** The HOPE architecture replaces a single MLP block in transformers with multiple MLP blocks updated at different frequencies — for example, every 128, 512, and 2,048 tokens. Slower-updating blocks retain knowledge that faster blocks forget, creating a loop where forgotten skills can re-emerge through backpropagation from stable layers. This directly addresses catastrophic forgetting without requiring separate replay buffers or task-specific fine-tuning strategies. - **Continual Learning Requires Two Phases, Not One:** A genuine continual learner eliminates the train/test distinction entirely, but still requires two operational modes: an active phase where inputs arrive and are processed, and a sleep phase where no external input occurs but internal computation continues. Current LLMs fail at continual learning because they freeze parameters post-training and rely on context windows that eventually overflow, making knowledge cutoffs structurally inevitable under the existing paradigm. - **Everything in Deep Learning Is Associative Memory:** Backpropagation, attention, RNNs, and optimizers all reduce to the same underlying operation — mapping keys to values through associative memory that compresses a context flow. Behrouz calls current architecture labels an "illusion" because the distinctions between optimizer and architecture dissolve under this lens. The gradient context for an optimizer and the token context for an architecture are structurally equivalent, enabling techniques from one domain to transfer directly to the other. - **Self-Referential Updates Outperform Standard Attention on Sequential Tasks:** In Self-Modifying Titan, the value vector in the associative memory is generated by the module's own current parameters rather than a fixed projection — making the update rule itself a function of the current state. This creates a fully sequential, causal process that standard softmax attention cannot replicate. The tradeoff is reduced parallelizability, but the gain is stronger performance on tasks requiring sequential reasoning and temporal dependency tracking. - **Multi-Language In-Context Learning as Architecture Benchmark:** When models must simultaneously learn two previously unseen languages (Manchu and MTOB) from in-context grammars and then translate them, standard transformers collapse in performance. HOPE architectures with three frequency levels recover near-single-language performance on both simultaneously. This result directly measures memory management quality — the ability to partition and preserve distinct knowledge streams — rather than simple recall or perplexity, making it a more diagnostic benchmark for continual learning capability. - **Sleep-Phase Distillation Transfers Knowledge Across Frequency Levels:** During the sleep phase, knowledge moves from fast-updating MLP blocks to slower ones via policy distillation — the fast block generates synthetic data from its current state, and the slow block trains on that data. This forces a compression step that produces higher-level abstractions rather than simple parameter copying. New parameters are added to the slower block before distillation to create capacity, and periodic pruning prevents unbounded model growth over a continual learning lifetime. - **The M3 Optimizer Applies Nested Learning to Gradient Compression:** By extending the Muon optimizer with two momentum buffers updated at different frequencies — mirroring the HOPE architecture's multi-frequency MLP design — the M3 optimizer outperforms both Adam and Muon on tested benchmarks. The faster momentum buffer tracks local gradient patterns while the slower one captures global loss landscape structure. The computational overhead is offset by faster convergence, and the result demonstrates that the nested frequency principle transfers from architecture design directly into optimization algorithm design. → NOTABLE MOMENT Behrouz argues that needle-in-a-haystack recall benchmarks are structurally biased toward transformers and should not be treated as general architecture comparisons. He notes that no human could perform perfect verbatim recall from thousands of tokens, making transformer success on these tasks a reflection of architectural quirk rather than general intelligence — a reframing that challenges how the field currently ranks competing architectures. 💼 SPONSORS [{"name": "Mercury", "url": "https://mercury.com"}, {"name": "Anthropic (Claude)", "url": "https://claude.ai/tcr"}] 🏷️ Continual Learning, AI Architecture, Transformer Alternatives, Associative Memory, Neural Network Optimization, Biologically-Inspired AI, Large Language Models

Read Full Summary Listen

Featured On 1 Podcast

Cognitive Revolution

All Appearances

Nested Learning: Ali Behrouz on the Quest for Continual Learning & Illusion of AI Architectures

AI Summary

Explore More

Never miss Ali Behrouz's insights