Lowering the Cost of Intelligence With NVIDIA's Ian Buck - Ep. 284
Episode
38 min
Read time
2 min
Topics
Fundraising & VC, Leadership, Design & UX
AI-Generated Summary
Key Takeaways
- ✓MOE Cost Reduction: DeepSeek's GPT-OSS model uses 120 billion total parameters but activates only 5 billion per query versus Llama's 405 billion fully active parameters, reducing benchmark costs from $200 to $75 while doubling intelligence scores through selective expert activation.
- ✓NVLink Communication Architecture: GB200 NVL72 connects 72 GPUs with non-blocking terabytes-per-second bandwidth using copper wires at 200 gigabits per second, enabling 15x performance improvement over 8-GPU Hopper systems while adding only 50% cost, achieving 10x token cost reduction to 10 cents per million tokens.
- ✓Expert Parallelization Strategy: Modern MOE models deploy 300-400 experts across multiple layers with router networks directing queries to 2-8 relevant experts simultaneously, combining responses without prescriptive knowledge domains—AI training naturally clusters information into specialized pockets through data exposure patterns rather than manual categorization.
- ✓Extreme Co-Design Process: NVIDIA software engineers outnumber hardware engineers to optimize end-to-end performance through kernel fusions and NVLink communication overlaps, recently achieving 2x performance gains on customer models within two weeks, directly halving token costs without hardware changes through software optimization alone.
What It Covers
Ian Buck explains how Mixture of Experts architecture powers leading AI models by activating only 3-10% of neural network parameters per query, reducing token costs by 10x while increasing intelligence scores from 28 to 61.
Key Questions Answered
- •MOE Cost Reduction: DeepSeek's GPT-OSS model uses 120 billion total parameters but activates only 5 billion per query versus Llama's 405 billion fully active parameters, reducing benchmark costs from $200 to $75 while doubling intelligence scores through selective expert activation.
- •NVLink Communication Architecture: GB200 NVL72 connects 72 GPUs with non-blocking terabytes-per-second bandwidth using copper wires at 200 gigabits per second, enabling 15x performance improvement over 8-GPU Hopper systems while adding only 50% cost, achieving 10x token cost reduction to 10 cents per million tokens.
- •Expert Parallelization Strategy: Modern MOE models deploy 300-400 experts across multiple layers with router networks directing queries to 2-8 relevant experts simultaneously, combining responses without prescriptive knowledge domains—AI training naturally clusters information into specialized pockets through data exposure patterns rather than manual categorization.
- •Extreme Co-Design Process: NVIDIA software engineers outnumber hardware engineers to optimize end-to-end performance through kernel fusions and NVLink communication overlaps, recently achieving 2x performance gains on customer models within two weeks, directly halving token costs without hardware changes through software optimization alone.
Notable Moment
Buck reveals that MOE models operate using PAM four signaling that transmits four bits per wire instead of binary zero-one, pushing physics limits with millimeter wavelengths to enable trillion-parameter models like QwenMax-2 that activate only 32 billion parameters per query.
You just read a 3-minute summary of a 35-minute episode.
Get NVIDIA AI Podcast summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from NVIDIA AI Podcast
How Mistral Is Building Frontier AI for the Enterprise | NVIDIA AI Podcast Ep. 301
Jun 10 · 21 min
Eye on AI
#299 Jacob Buckman: Why the Future of AI Won't Be Built on Transformers
Nov 9
More from NVIDIA AI Podcast
Everyone Can Build a Robot: Open Source Embodied AI With Seeed Studio | NVIDIA AI Podcast Ep. 300
May 27 · 29 min
Cognitive Revolution
Nested Learning: Ali Behrouz on the Quest for Continual Learning & Illusion of AI Architectures
Jun 3
More from NVIDIA AI Podcast
We summarize every new episode. Want them in your inbox?
How Mistral Is Building Frontier AI for the Enterprise | NVIDIA AI Podcast Ep. 301
Everyone Can Build a Robot: Open Source Embodied AI With Seeed Studio | NVIDIA AI Podcast Ep. 300
Inside AI Tokenomics: How to Profitably Turn Tokens Into Business Value | NVIDIA AI Podcast Ep. 299
Snap’s Secret to Processing 10 Petabytes a Day: GPU-Accelerated Spark | NVIDIA AI Podcast Ep. 298
Harrison Chase of LangChain on Deep Agents, LangSmith, and Earning Trust | NVIDIA AI Podcast Ep. 297
Similar Episodes
Related episodes from other podcasts
Eye on AI
Nov 9
#299 Jacob Buckman: Why the Future of AI Won't Be Built on Transformers
Cognitive Revolution
Jun 3
Nested Learning: Ali Behrouz on the Quest for Continual Learning & Illusion of AI Architectures
Software Engineering Daily
May 21
React Native at Scale
Odd Lots
May 21
Why Cerebras CEO Andrew Feldman Built The World's Largest Computer Chip
Cognitive Revolution
Apr 4
Training the AIs' Eyes: How Roboflow is Making the Real World Programmable, with CEO Joseph Nelson
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
You're clearly into NVIDIA AI Podcast.
Every Monday, we deliver AI summaries of the latest episodes from NVIDIA AI Podcast and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime