Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750
Episode
57 min
Read time
2 min
AI-Generated Summary
Key Takeaways
- ✓State Size Balance: Transformers have states 100,000x larger than LSTMs at long context, while RNNs have states too small. Optimal architectures balance weight FLOPS and state FLOPS within one order of magnitude for compute-efficient training and inference.
- ✓Chunked Algorithm: Power retention uses dual computation forms—recurrent for sequential processing and attention for parallel processing. Breaking sequences into GPU-optimized chunks provides linear cost scaling while maintaining full hardware saturation, achieving best of both approaches without mathematical tradeoffs.
- ✓Model Metamorphosis: Converting existing transformer models to power retention requires only two hours of retraining on 128 H100s. StarCoder 3B recovered full 30% HumanEval performance after this brief metamorphosis period, making adoption practical without pretraining from scratch.
- ✓Vidrial CUDA Framework: Custom CUDA framework enables 20% speedups over Flash Attention on non-standard problem shapes by separating static and dynamic computation. JIT compilation sweeps different configurations to find optimal tile sizes and memory patterns for specific hardware and sequence lengths.
What It Covers
Jacob Buckman explains power retention architecture for transformers, combining recurrence and attention to achieve linear scaling for long context processing while maintaining computational efficiency through balanced weight-state FLOP ratios and chunked algorithms.
Key Questions Answered
- •State Size Balance: Transformers have states 100,000x larger than LSTMs at long context, while RNNs have states too small. Optimal architectures balance weight FLOPS and state FLOPS within one order of magnitude for compute-efficient training and inference.
- •Chunked Algorithm: Power retention uses dual computation forms—recurrent for sequential processing and attention for parallel processing. Breaking sequences into GPU-optimized chunks provides linear cost scaling while maintaining full hardware saturation, achieving best of both approaches without mathematical tradeoffs.
- •Model Metamorphosis: Converting existing transformer models to power retention requires only two hours of retraining on 128 H100s. StarCoder 3B recovered full 30% HumanEval performance after this brief metamorphosis period, making adoption practical without pretraining from scratch.
- •Vidrial CUDA Framework: Custom CUDA framework enables 20% speedups over Flash Attention on non-standard problem shapes by separating static and dynamic computation. JIT compilation sweeps different configurations to find optimal tile sizes and memory patterns for specific hardware and sequence lengths.
Notable Moment
Buckman reveals that typical window attention models plateau in their ability to use context far earlier than their advertised effective context length, which is calculated as depth times window size, demonstrating they fail to leverage most available tokens.
You just read a 3-minute summary of a 54-minute episode.
Get The TWIML AI Podcast summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from The TWIML AI Podcast
How Capital One Delivers Multi-Agent Systems with Rashmi Shetty - #765
Apr 16 · 54 min
Masters of Scale
Possible: Netflix co-founder Reed Hastings: stories, schools, superpowers
Apr 25
More from The TWIML AI Podcast
The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764
Mar 26 · 63 min
The Futur
Why Process is Better Than AI w/ Scott Clum | Ep 430
Apr 25
More from The TWIML AI Podcast
We summarize every new episode. Want them in your inbox?
How Capital One Delivers Multi-Agent Systems with Rashmi Shetty - #765
The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764
Agent Swarms and Knowledge Graphs for Autonomous Software Development with Siddhant Pardeshi - #763
AI Trends 2026: OpenClaw Agents, Reasoning LLMs, and More with Sebastian Raschka - #762
The Evolution of Reasoning in Small Language Models with Yejin Choi - #761
Similar Episodes
Related episodes from other podcasts
Masters of Scale
Apr 25
Possible: Netflix co-founder Reed Hastings: stories, schools, superpowers
The Futur
Apr 25
Why Process is Better Than AI w/ Scott Clum | Ep 430
20VC (20 Minute VC)
Apr 25
20Product: Replit CEO on Why Coding Models Are Plateauing | Why the SaaS Apocalypse is Justified: Will Incumbents Be Replaced? | Why IDEs Are Dead and Do PMs Survive the Next 3-5 Years with Amjad Masad
This Week in Startups
Apr 25
The Defense Tech Startup YC Kicked Out of a Meeting is Now Arming America | E2280
Marketplace
Apr 24
When does AI become a spending suck?
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
You're clearly into The TWIML AI Podcast.
Every Monday, we deliver AI summaries of the latest episodes from The TWIML AI Podcast and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime