Skip to main content
Eye on AI

#324 Sharon Zhou: Inside AMD's Plan to Build Self-Improving AI

46 min episode · 2 min read
·

Episode

46 min

Read time

2 min

Topics

Artificial Intelligence

AI-Generated Summary

Key Takeaways

  • Catastrophic Forgetting Prevention: When fine-tuning models without access to original pre-training data, reintroducing as little as 1% of pre-training data during post-training significantly reduces catastrophic forgetting. This allows models to reconnect with earlier representations. Developers doing heavy post-training workloads should monitor this actively, as even small fine-tuning tasks can compound into larger degradation over time.
  • Kernel Optimization Economics: Speeding up a single matrix multiplication kernel — which executes billions or trillions of times inside one model — can translate to hundreds of billions of dollars in savings at frontier scale. Even for smaller deployments, a 10x kernel speed improvement is economically equivalent to purchasing 10x more GPU compute hardware.
  • Verifiable Rewards for RL Training: AMD's kernel generation pipeline uses GPU profiler output as a verifiable reward signal for reinforcement learning, similar to how math correctness verified ChatGPT's reasoning training. Because profiler speed metrics are objective and non-subjective, they feed directly back into post-training loops without requiring costly human preference labeling.
  • AMD's Open-Source Advantage: AMD's ROCm software stack is open-source, unlike NVIDIA's CUDA. This means language models can train directly on ROCm documentation and code, enabling AI agents to learn AMD-specific kernel writing more effectively. Developers building on AMD hardware benefit from this transparency when using AI coding tools like Cursor to generate optimized kernels.
  • Kernel Engineering Skill Gap: Writing optimized GPU kernels requires simultaneous expertise in GPU architecture specifics and model mathematics — a combination rare enough to bottleneck even frontier labs. A complex kernel can take a non-expert months and an expert several weeks to write manually, making AI-assisted kernel generation a high-leverage productivity multiplier across the entire AI development stack.

What It Covers

Sharon Zhou, VP of AI at AMD and Stanford PhD graduate, explains how AMD uses AI agents and reinforcement learning to autonomously generate and optimize low-level GPU kernel code, enabling language models to run faster on AMD hardware while reducing the rare human expertise bottleneck in kernel engineering.

Key Questions Answered

  • Catastrophic Forgetting Prevention: When fine-tuning models without access to original pre-training data, reintroducing as little as 1% of pre-training data during post-training significantly reduces catastrophic forgetting. This allows models to reconnect with earlier representations. Developers doing heavy post-training workloads should monitor this actively, as even small fine-tuning tasks can compound into larger degradation over time.
  • Kernel Optimization Economics: Speeding up a single matrix multiplication kernel — which executes billions or trillions of times inside one model — can translate to hundreds of billions of dollars in savings at frontier scale. Even for smaller deployments, a 10x kernel speed improvement is economically equivalent to purchasing 10x more GPU compute hardware.
  • Verifiable Rewards for RL Training: AMD's kernel generation pipeline uses GPU profiler output as a verifiable reward signal for reinforcement learning, similar to how math correctness verified ChatGPT's reasoning training. Because profiler speed metrics are objective and non-subjective, they feed directly back into post-training loops without requiring costly human preference labeling.
  • AMD's Open-Source Advantage: AMD's ROCm software stack is open-source, unlike NVIDIA's CUDA. This means language models can train directly on ROCm documentation and code, enabling AI agents to learn AMD-specific kernel writing more effectively. Developers building on AMD hardware benefit from this transparency when using AI coding tools like Cursor to generate optimized kernels.
  • Kernel Engineering Skill Gap: Writing optimized GPU kernels requires simultaneous expertise in GPU architecture specifics and model mathematics — a combination rare enough to bottleneck even frontier labs. A complex kernel can take a non-expert months and an expert several weeks to write manually, making AI-assisted kernel generation a high-leverage productivity multiplier across the entire AI development stack.

Notable Moment

When asked whether faster kernel optimization might relieve the global chip shortage pressure, Zhou flatly dismissed the idea — stating that demand for compute is effectively infinite and no organization has reached a point where efficiency gains reduce their appetite for more hardware.

Know someone who'd find this useful?

You just read a 3-minute summary of a 43-minute episode.

Get Eye on AI summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

More from Eye on AI

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

Explore Related Topics

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into Eye on AI.

Every Monday, we deliver AI summaries of the latest episodes from Eye on AI and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime