Lowering the Cost of Intelligence With NVIDIA's Ian Buck - Ep. 284

December 29, 2025

38 min episode · 2 min read

Nvidia's Ian Buck

Episode

38 min

Read time

2 min

AI-Generated Summary

Published Dec 31, 2025

Key Takeaways

✓MOE Cost Reduction: DeepSeek's GPT-OSS model uses 120 billion total parameters but activates only 5 billion per query versus Llama's 405 billion fully active parameters, reducing benchmark costs from $200 to $75 while doubling intelligence scores through selective expert activation.
✓NVLink Communication Architecture: GB200 NVL72 connects 72 GPUs with non-blocking terabytes-per-second bandwidth using copper wires at 200 gigabits per second, enabling 15x performance improvement over 8-GPU Hopper systems while adding only 50% cost, achieving 10x token cost reduction to 10 cents per million tokens.
✓Expert Parallelization Strategy: Modern MOE models deploy 300-400 experts across multiple layers with router networks directing queries to 2-8 relevant experts simultaneously, combining responses without prescriptive knowledge domains—AI training naturally clusters information into specialized pockets through data exposure patterns rather than manual categorization.
✓Extreme Co-Design Process: NVIDIA software engineers outnumber hardware engineers to optimize end-to-end performance through kernel fusions and NVLink communication overlaps, recently achieving 2x performance gains on customer models within two weeks, directly halving token costs without hardware changes through software optimization alone.

What It Covers

Ian Buck explains how Mixture of Experts architecture powers leading AI models by activating only 3-10% of neural network parameters per query, reducing token costs by 10x while increasing intelligence scores from 28 to 61.

Key Questions Answered

•MOE Cost Reduction: DeepSeek's GPT-OSS model uses 120 billion total parameters but activates only 5 billion per query versus Llama's 405 billion fully active parameters, reducing benchmark costs from $200 to $75 while doubling intelligence scores through selective expert activation.
•NVLink Communication Architecture: GB200 NVL72 connects 72 GPUs with non-blocking terabytes-per-second bandwidth using copper wires at 200 gigabits per second, enabling 15x performance improvement over 8-GPU Hopper systems while adding only 50% cost, achieving 10x token cost reduction to 10 cents per million tokens.
•Expert Parallelization Strategy: Modern MOE models deploy 300-400 experts across multiple layers with router networks directing queries to 2-8 relevant experts simultaneously, combining responses without prescriptive knowledge domains—AI training naturally clusters information into specialized pockets through data exposure patterns rather than manual categorization.
•Extreme Co-Design Process: NVIDIA software engineers outnumber hardware engineers to optimize end-to-end performance through kernel fusions and NVLink communication overlaps, recently achieving 2x performance gains on customer models within two weeks, directly halving token costs without hardware changes through software optimization alone.

Notable Moment

Buck reveals that MOE models operate using PAM four signaling that transmits four bits per wire instead of binary zero-one, pushing physics limits with millimeter wavelengths to enable trillion-parameter models like QwenMax-2 that activate only 32 billion parameters per query.

Know someone who'd find this useful?