Dataflow Computing for AI Inference with Kunle Olukotun - #751

October 14, 2025

57 min episode · 2 min read

Kunle Olukotun

Episode

57 min

Read time

2 min

Topics

Artificial Intelligence, Science & Discovery

AI-Generated Summary

Published Dec 31, 2025

Key Takeaways

✓Dataflow vs Instructions: Reconfigurable dataflow architectures configure hardware to match PyTorch computation graphs rather than fetching instructions each cycle, using token-based synchronization instead of locks and barriers, achieving 2-3x higher HBM bandwidth utilization than GPUs through asynchronous parallel execution.
✓Decoder Fusion Strategy: Mapping an entire LLama decoder across 16 RDU chips in space eliminates intermediate data movement across HBM boundaries, creating a fused kernel that provides flash attention benefits across the whole decoder rather than just attention mechanisms, dramatically reducing memory bandwidth requirements.
✓Multi-Model Serving: The SN40L chip includes 1.5TB DDR memory alongside 64GB HBM, enabling 5 trillion total parameters resident simultaneously with millisecond model switching latency, allowing high utilization while serving custom fine-tuned models without dedicating separate accelerators per model.
✓Dynamic Architecture Evolution: Research focuses on dynamic reconfigurable dataflow using streaming tensor programs to handle mixture-of-experts models, variable context lengths, and sparse computations by enabling runtime graph reconfiguration at sub-microsecond latency rather than static microsecond-scale mapping used in current generation systems.

What It Covers

Kunle Olukotun explains how SambaNova's reconfigurable dataflow architecture achieves 5-10x better performance per watt for AI inference by eliminating instruction fetching, maximizing memory bandwidth utilization, and enabling microsecond model switching across trillion-parameter systems.

Key Questions Answered

•Dataflow vs Instructions: Reconfigurable dataflow architectures configure hardware to match PyTorch computation graphs rather than fetching instructions each cycle, using token-based synchronization instead of locks and barriers, achieving 2-3x higher HBM bandwidth utilization than GPUs through asynchronous parallel execution.
•Decoder Fusion Strategy: Mapping an entire LLama decoder across 16 RDU chips in space eliminates intermediate data movement across HBM boundaries, creating a fused kernel that provides flash attention benefits across the whole decoder rather than just attention mechanisms, dramatically reducing memory bandwidth requirements.
•Multi-Model Serving: The SN40L chip includes 1.5TB DDR memory alongside 64GB HBM, enabling 5 trillion total parameters resident simultaneously with millisecond model switching latency, allowing high utilization while serving custom fine-tuned models without dedicating separate accelerators per model.
•Dynamic Architecture Evolution: Research focuses on dynamic reconfigurable dataflow using streaming tensor programs to handle mixture-of-experts models, variable context lengths, and sparse computations by enabling runtime graph reconfiguration at sub-microsecond latency rather than static microsecond-scale mapping used in current generation systems.

Notable Moment

Olukotun reveals SambaNova maintains 5x lower latency than GPUs even at high batch sizes because tensor parallelism with overlapped communication remains efficient on dataflow architectures, while GPUs cannot effectively hide communication latency, fundamentally changing the throughput-latency tradeoff curve.

Know someone who'd find this useful?