Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750

October 7, 2025

57 min episode · 2 min read

Jacob Buckman

Episode

57 min

Read time

2 min

AI-Generated Summary

Published Dec 31, 2025

Key Takeaways

✓State Size Balance: Transformers have states 100,000x larger than LSTMs at long context, while RNNs have states too small. Optimal architectures balance weight FLOPS and state FLOPS within one order of magnitude for compute-efficient training and inference.
✓Chunked Algorithm: Power retention uses dual computation forms—recurrent for sequential processing and attention for parallel processing. Breaking sequences into GPU-optimized chunks provides linear cost scaling while maintaining full hardware saturation, achieving best of both approaches without mathematical tradeoffs.
✓Model Metamorphosis: Converting existing transformer models to power retention requires only two hours of retraining on 128 H100s. StarCoder 3B recovered full 30% HumanEval performance after this brief metamorphosis period, making adoption practical without pretraining from scratch.
✓Vidrial CUDA Framework: Custom CUDA framework enables 20% speedups over Flash Attention on non-standard problem shapes by separating static and dynamic computation. JIT compilation sweeps different configurations to find optimal tile sizes and memory patterns for specific hardware and sequence lengths.

What It Covers

Jacob Buckman explains power retention architecture for transformers, combining recurrence and attention to achieve linear scaling for long context processing while maintaining computational efficiency through balanced weight-state FLOP ratios and chunked algorithms.

Key Questions Answered

•State Size Balance: Transformers have states 100,000x larger than LSTMs at long context, while RNNs have states too small. Optimal architectures balance weight FLOPS and state FLOPS within one order of magnitude for compute-efficient training and inference.
•Chunked Algorithm: Power retention uses dual computation forms—recurrent for sequential processing and attention for parallel processing. Breaking sequences into GPU-optimized chunks provides linear cost scaling while maintaining full hardware saturation, achieving best of both approaches without mathematical tradeoffs.
•Model Metamorphosis: Converting existing transformer models to power retention requires only two hours of retraining on 128 H100s. StarCoder 3B recovered full 30% HumanEval performance after this brief metamorphosis period, making adoption practical without pretraining from scratch.
•Vidrial CUDA Framework: Custom CUDA framework enables 20% speedups over Flash Attention on non-standard problem shapes by separating static and dynamic computation. JIT compilation sweeps different configurations to find optimal tile sizes and memory patterns for specific hardware and sequence lengths.

Notable Moment

Buckman reveals that typical window attention models plateau in their ability to use context far earlier than their advertised effective context length, which is calculated as depth times window size, demonstrating they fail to leverage most available tokens.

Know someone who'd find this useful?