Skip to main content
NVIDIA AI Podcast

Lowering the Cost of Intelligence With NVIDIA's Ian Buck - Ep. 284

38 min episode · 2 min read
·

Episode

38 min

Read time

2 min

AI-Generated Summary

Key Takeaways

  • MOE Cost Reduction: DeepSeek's GPT-OSS model uses 120 billion total parameters but activates only 5 billion per query versus Llama's 405 billion fully active parameters, reducing benchmark costs from $200 to $75 while doubling intelligence scores through selective expert activation.
  • NVLink Communication Architecture: GB200 NVL72 connects 72 GPUs with non-blocking terabytes-per-second bandwidth using copper wires at 200 gigabits per second, enabling 15x performance improvement over 8-GPU Hopper systems while adding only 50% cost, achieving 10x token cost reduction to 10 cents per million tokens.
  • Expert Parallelization Strategy: Modern MOE models deploy 300-400 experts across multiple layers with router networks directing queries to 2-8 relevant experts simultaneously, combining responses without prescriptive knowledge domains—AI training naturally clusters information into specialized pockets through data exposure patterns rather than manual categorization.
  • Extreme Co-Design Process: NVIDIA software engineers outnumber hardware engineers to optimize end-to-end performance through kernel fusions and NVLink communication overlaps, recently achieving 2x performance gains on customer models within two weeks, directly halving token costs without hardware changes through software optimization alone.

What It Covers

Ian Buck explains how Mixture of Experts architecture powers leading AI models by activating only 3-10% of neural network parameters per query, reducing token costs by 10x while increasing intelligence scores from 28 to 61.

Key Questions Answered

  • MOE Cost Reduction: DeepSeek's GPT-OSS model uses 120 billion total parameters but activates only 5 billion per query versus Llama's 405 billion fully active parameters, reducing benchmark costs from $200 to $75 while doubling intelligence scores through selective expert activation.
  • NVLink Communication Architecture: GB200 NVL72 connects 72 GPUs with non-blocking terabytes-per-second bandwidth using copper wires at 200 gigabits per second, enabling 15x performance improvement over 8-GPU Hopper systems while adding only 50% cost, achieving 10x token cost reduction to 10 cents per million tokens.
  • Expert Parallelization Strategy: Modern MOE models deploy 300-400 experts across multiple layers with router networks directing queries to 2-8 relevant experts simultaneously, combining responses without prescriptive knowledge domains—AI training naturally clusters information into specialized pockets through data exposure patterns rather than manual categorization.
  • Extreme Co-Design Process: NVIDIA software engineers outnumber hardware engineers to optimize end-to-end performance through kernel fusions and NVLink communication overlaps, recently achieving 2x performance gains on customer models within two weeks, directly halving token costs without hardware changes through software optimization alone.

Notable Moment

Buck reveals that MOE models operate using PAM four signaling that transmits four bits per wire instead of binary zero-one, pushing physics limits with millimeter wavelengths to enable trillion-parameter models like QwenMax-2 that activate only 32 billion parameters per query.

Know someone who'd find this useful?

You just read a 3-minute summary of a 35-minute episode.

Get NVIDIA AI Podcast summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

More from NVIDIA AI Podcast

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

You're clearly into NVIDIA AI Podcast.

Every Monday, we deliver AI summaries of the latest episodes from NVIDIA AI Podcast and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime