Dataflow Computing for AI Inference with Kunle Olukotun - #751
Episode
57 min
Read time
2 min
Topics
Productivity, Remote Work, Startups
AI-Generated Summary
Key Takeaways
- ✓Dataflow vs Instructions: Reconfigurable dataflow architectures configure hardware to match PyTorch computation graphs rather than fetching instructions each cycle, using token-based synchronization instead of locks and barriers, achieving 2-3x higher HBM bandwidth utilization than GPUs through asynchronous parallel execution.
- ✓Decoder Fusion Strategy: Mapping an entire LLama decoder across 16 RDU chips in space eliminates intermediate data movement across HBM boundaries, creating a fused kernel that provides flash attention benefits across the whole decoder rather than just attention mechanisms, dramatically reducing memory bandwidth requirements.
- ✓Multi-Model Serving: The SN40L chip includes 1.5TB DDR memory alongside 64GB HBM, enabling 5 trillion total parameters resident simultaneously with millisecond model switching latency, allowing high utilization while serving custom fine-tuned models without dedicating separate accelerators per model.
- ✓Dynamic Architecture Evolution: Research focuses on dynamic reconfigurable dataflow using streaming tensor programs to handle mixture-of-experts models, variable context lengths, and sparse computations by enabling runtime graph reconfiguration at sub-microsecond latency rather than static microsecond-scale mapping used in current generation systems.
What It Covers
Kunle Olukotun explains how SambaNova's reconfigurable dataflow architecture achieves 5-10x better performance per watt for AI inference by eliminating instruction fetching, maximizing memory bandwidth utilization, and enabling microsecond model switching across trillion-parameter systems.
Key Questions Answered
- •Dataflow vs Instructions: Reconfigurable dataflow architectures configure hardware to match PyTorch computation graphs rather than fetching instructions each cycle, using token-based synchronization instead of locks and barriers, achieving 2-3x higher HBM bandwidth utilization than GPUs through asynchronous parallel execution.
- •Decoder Fusion Strategy: Mapping an entire LLama decoder across 16 RDU chips in space eliminates intermediate data movement across HBM boundaries, creating a fused kernel that provides flash attention benefits across the whole decoder rather than just attention mechanisms, dramatically reducing memory bandwidth requirements.
- •Multi-Model Serving: The SN40L chip includes 1.5TB DDR memory alongside 64GB HBM, enabling 5 trillion total parameters resident simultaneously with millisecond model switching latency, allowing high utilization while serving custom fine-tuned models without dedicating separate accelerators per model.
- •Dynamic Architecture Evolution: Research focuses on dynamic reconfigurable dataflow using streaming tensor programs to handle mixture-of-experts models, variable context lengths, and sparse computations by enabling runtime graph reconfiguration at sub-microsecond latency rather than static microsecond-scale mapping used in current generation systems.
Notable Moment
Olukotun reveals SambaNova maintains 5x lower latency than GPUs even at high batch sizes because tensor parallelism with overlapped communication remains efficient on dataflow architectures, while GPUs cannot effectively hide communication latency, fundamentally changing the throughput-latency tradeoff curve.
You just read a 3-minute summary of a 54-minute episode.
Get The TWIML AI Podcast summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from The TWIML AI Podcast
Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut - #769
Jun 9 · 51 min
Latent Space
Retrieval After RAG: Hybrid Search, Agents, and Database Design — Simon Hørup Eskildsen of Turbopuffer
Mar 12
More from The TWIML AI Podcast
Relational Foundation Models for Enterprise Data with Jure Leskovec - #768
May 21 · 66 min
a16z Podcast
Building Search for AI Agents with Exa CEO Will Bryk
Jun 6
More from The TWIML AI Podcast
We summarize every new episode. Want them in your inbox?
Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut - #769
Relational Foundation Models for Enterprise Data with Jure Leskovec - #768
How to Find the Agent Failures Your Evals Miss with Scott Clark - #767
How to Engineer AI Inference Systems with Philip Kiely - #766
How Capital One Delivers Multi-Agent Systems with Rashmi Shetty - #765
Similar Episodes
Related episodes from other podcasts
Latent Space
Mar 12
Retrieval After RAG: Hybrid Search, Agents, and Database Design — Simon Hørup Eskildsen of Turbopuffer
a16z Podcast
Jun 6
Building Search for AI Agents with Exa CEO Will Bryk
Cognitive Revolution
Jun 3
Nested Learning: Ali Behrouz on the Quest for Continual Learning & Illusion of AI Architectures
The Jordan Harbinger Show
May 28
1334: Justin Garcia | Why We Live, Cheat, Break, and Die for Love
This Week in Startups
May 27
The Drone Company Quietly Taking Over Delivery
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's Startups & Product Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into The TWIML AI Podcast.
Every Monday, we deliver AI summaries of the latest episodes from The TWIML AI Podcast and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime