Skip to main content
JB

Jacob Buckman

Jacob Buckman is a machine learning researcher pioneering innovative approaches to transformer architecture and long-context AI processing. Specializing in computational efficiency, Buckman has developed the Power Retention architecture, which addresses critical scaling limitations in current transformer models by enabling linear-cost context processing and more flexible state management. His work challenges fundamental assumptions about neural network design, proposing novel methods that could significantly expand the computational capabilities of AI systems by balancing weight and state computational complexity. Through his research, Buckman explores how alternative neural network architectures might overcome current transformer limitations, particularly in handling extended contextual information. His insights have been featured across leading AI and machine learning podcasts, positioning him as an emerging voice in next-generation neural network design.

2episodes
2podcasts

We have 2 summarized appearances for Jacob Buckman so far. Browse all podcasts to discover more episodes.

Featured On 2 Podcasts

All Appearances

2 episodes

AI Summary

→ WHAT IT COVERS Jacob Buckman explains power retention architecture for transformers, combining recurrence and attention to achieve linear scaling for long context processing while maintaining computational efficiency through balanced weight-state FLOP ratios and chunked algorithms. → KEY INSIGHTS - **State Size Balance:** Transformers have states 100,000x larger than LSTMs at long context, while RNNs have states too small. Optimal architectures balance weight FLOPS and state FLOPS within one order of magnitude for compute-efficient training and inference. - **Chunked Algorithm:** Power retention uses dual computation forms—recurrent for sequential processing and attention for parallel processing. Breaking sequences into GPU-optimized chunks provides linear cost scaling while maintaining full hardware saturation, achieving best of both approaches without mathematical tradeoffs. - **Model Metamorphosis:** Converting existing transformer models to power retention requires only two hours of retraining on 128 H100s. StarCoder 3B recovered full 30% HumanEval performance after this brief metamorphosis period, making adoption practical without pretraining from scratch. - **Vidrial CUDA Framework:** Custom CUDA framework enables 20% speedups over Flash Attention on non-standard problem shapes by separating static and dynamic computation. JIT compilation sweeps different configurations to find optimal tile sizes and memory patterns for specific hardware and sequence lengths. → NOTABLE MOMENT Buckman reveals that typical window attention models plateau in their ability to use context far earlier than their advertised effective context length, which is calculated as depth times window size, demonstrating they fail to leverage most available tokens. 💼 SPONSORS [{"name": "Capital One", "url": ""}] 🏷️ Transformer Architecture, Long Context Models, CUDA Optimization, State Space Models

Eye on AI

#299 Jacob Buckman: Why the Future of AI Won't Be Built on Transformers

Eye on AI
57 minFounder of Manifest AI, AI Researcher

AI Summary

→ WHAT IT COVERS Jacob Buckman explains Power Retention, a new AI architecture that solves transformer scaling limitations through linear-cost context windows, enabling models to process unlimited context without quadratic compute costs or performance degradation. → KEY INSIGHTS - **Power Retention Architecture:** Combines recurrent neural networks with attention mechanisms through state space models, allowing independent adjustment of state size from parameter count. This enables linear scaling costs instead of quadratic growth as context windows expand. - **Metamorphosis Retraining Process:** Existing transformer models like LLAMA can convert to Power Retention in six hours using dozens of GPUs by swapping attention calls for power retention, preserving original performance while gaining linear-cost inference and unlimited context capabilities. - **Context vs Weight Updates:** Future AI systems should inject new knowledge through context state updates rather than weight fine-tuning. This eliminates catastrophic forgetting issues since context-based learning mirrors human experience accumulation rather than evolutionary weight changes through gradient descent. - **Butler vs Consultant Dynamic:** Current transformers force chat resets due to expensive state growth, creating consultant-like interactions. Power Retention enables persistent state across all user interactions, creating butler-like AI that accumulates complete user history and preferences for better responses. → NOTABLE MOMENT Buckman reveals that advertised long-context models use sparse or windowed attention rather than true transformers, processing only small context subsets. This industry-wide practice creates performance degradation that users mistake for inherent limitations rather than architectural compromises. 💼 SPONSORS [{"name": "Agency", "url": "https://agntcy.org"}] 🏷️ AI Architecture, State Space Models, Context Windows, Transformer Alternatives

Explore More

Never miss Jacob Buckman's insights

Subscribe to get AI-powered summaries of Jacob Buckman's podcast appearances delivered to your inbox weekly.

Start Free Today

No credit card required • Free tier available