Lowering the Cost of Intelligence With NVIDIA's Ian Buck - Ep. 284
Episode
38 min
Read time
2 min
AI-Generated Summary
Key Takeaways
- ✓MOE Cost Reduction: DeepSeek's GPT-OSS model uses 120 billion total parameters but activates only 5 billion per query versus Llama's 405 billion fully active parameters, reducing benchmark costs from $200 to $75 while doubling intelligence scores through selective expert activation.
- ✓NVLink Communication Architecture: GB200 NVL72 connects 72 GPUs with non-blocking terabytes-per-second bandwidth using copper wires at 200 gigabits per second, enabling 15x performance improvement over 8-GPU Hopper systems while adding only 50% cost, achieving 10x token cost reduction to 10 cents per million tokens.
- ✓Expert Parallelization Strategy: Modern MOE models deploy 300-400 experts across multiple layers with router networks directing queries to 2-8 relevant experts simultaneously, combining responses without prescriptive knowledge domains—AI training naturally clusters information into specialized pockets through data exposure patterns rather than manual categorization.
- ✓Extreme Co-Design Process: NVIDIA software engineers outnumber hardware engineers to optimize end-to-end performance through kernel fusions and NVLink communication overlaps, recently achieving 2x performance gains on customer models within two weeks, directly halving token costs without hardware changes through software optimization alone.
What It Covers
Ian Buck explains how Mixture of Experts architecture powers leading AI models by activating only 3-10% of neural network parameters per query, reducing token costs by 10x while increasing intelligence scores from 28 to 61.
Key Questions Answered
- •MOE Cost Reduction: DeepSeek's GPT-OSS model uses 120 billion total parameters but activates only 5 billion per query versus Llama's 405 billion fully active parameters, reducing benchmark costs from $200 to $75 while doubling intelligence scores through selective expert activation.
- •NVLink Communication Architecture: GB200 NVL72 connects 72 GPUs with non-blocking terabytes-per-second bandwidth using copper wires at 200 gigabits per second, enabling 15x performance improvement over 8-GPU Hopper systems while adding only 50% cost, achieving 10x token cost reduction to 10 cents per million tokens.
- •Expert Parallelization Strategy: Modern MOE models deploy 300-400 experts across multiple layers with router networks directing queries to 2-8 relevant experts simultaneously, combining responses without prescriptive knowledge domains—AI training naturally clusters information into specialized pockets through data exposure patterns rather than manual categorization.
- •Extreme Co-Design Process: NVIDIA software engineers outnumber hardware engineers to optimize end-to-end performance through kernel fusions and NVLink communication overlaps, recently achieving 2x performance gains on customer models within two weeks, directly halving token costs without hardware changes through software optimization alone.
Notable Moment
Buck reveals that MOE models operate using PAM four signaling that transmits four bits per wire instead of binary zero-one, pushing physics limits with millimeter wavelengths to enable trillion-parameter models like QwenMax-2 that activate only 32 billion parameters per query.
You just read a 3-minute summary of a 35-minute episode.
Get NVIDIA AI Podcast summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from NVIDIA AI Podcast
One Brain, Any Robot: Skild AI's Skild Brain Explained - Ep. 295
Apr 22 · 29 min
a16z Podcast
Ben Horowitz on Venture Capital and AI
Apr 27
More from NVIDIA AI Podcast
How AI Will Change Quantum Computing - Ep. 294
Apr 14 · 31 min
Up First (NPR)
White House Response To Shooting, Shooter Investigation, King Charles State Visit
Apr 27
More from NVIDIA AI Podcast
We summarize every new episode. Want them in your inbox?
One Brain, Any Robot: Skild AI's Skild Brain Explained - Ep. 295
How AI Will Change Quantum Computing - Ep. 294
Building AI Factories: How Red Hat and NVIDIA Turn Enterprise Data Into Intelligence - Ep. 293
Powering the AI Inference Wave with EPRI's Ben Sooter - Ep. 292
AI Agents and the Future of Global Trade with Alibaba’s Kuo Zhang - Ep. 291
Similar Episodes
Related episodes from other podcasts
a16z Podcast
Apr 27
Ben Horowitz on Venture Capital and AI
Up First (NPR)
Apr 27
White House Response To Shooting, Shooter Investigation, King Charles State Visit
The Prof G Pod
Apr 27
Why International Stocks Are Beating the S&P + How Scott Invests his Money
Snacks Daily
Apr 27
🏈 “Endorse My Ball” — Fernando Mendoza’s LinkedIn-ing. Intel’s chip-rip-dip. The Vatican’s AI savior. +Uber Spy Pricing
The Indicator
Apr 27
Premium and affordable products are having a moment
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
You're clearly into NVIDIA AI Podcast.
Every Monday, we deliver AI summaries of the latest episodes from NVIDIA AI Podcast and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime