#324 Sharon Zhou: Inside AMD's Plan to Build Self-Improving AI

February 27, 2026

46 min episode · 2 min read

Sharon Zhou

Episode

46 min

Read time

2 min

Topics

Artificial Intelligence

AI-Generated Summary

Published Mar 3, 2026

Key Takeaways

✓Catastrophic Forgetting Prevention: When fine-tuning models without access to original pre-training data, reintroducing as little as 1% of pre-training data during post-training significantly reduces catastrophic forgetting. This allows models to reconnect with earlier representations. Developers doing heavy post-training workloads should monitor this actively, as even small fine-tuning tasks can compound into larger degradation over time.
✓Kernel Optimization Economics: Speeding up a single matrix multiplication kernel — which executes billions or trillions of times inside one model — can translate to hundreds of billions of dollars in savings at frontier scale. Even for smaller deployments, a 10x kernel speed improvement is economically equivalent to purchasing 10x more GPU compute hardware.
✓Verifiable Rewards for RL Training: AMD's kernel generation pipeline uses GPU profiler output as a verifiable reward signal for reinforcement learning, similar to how math correctness verified ChatGPT's reasoning training. Because profiler speed metrics are objective and non-subjective, they feed directly back into post-training loops without requiring costly human preference labeling.
✓AMD's Open-Source Advantage: AMD's ROCm software stack is open-source, unlike NVIDIA's CUDA. This means language models can train directly on ROCm documentation and code, enabling AI agents to learn AMD-specific kernel writing more effectively. Developers building on AMD hardware benefit from this transparency when using AI coding tools like Cursor to generate optimized kernels.
✓Kernel Engineering Skill Gap: Writing optimized GPU kernels requires simultaneous expertise in GPU architecture specifics and model mathematics — a combination rare enough to bottleneck even frontier labs. A complex kernel can take a non-expert months and an expert several weeks to write manually, making AI-assisted kernel generation a high-leverage productivity multiplier across the entire AI development stack.

What It Covers

Sharon Zhou, VP of AI at AMD and Stanford PhD graduate, explains how AMD uses AI agents and reinforcement learning to autonomously generate and optimize low-level GPU kernel code, enabling language models to run faster on AMD hardware while reducing the rare human expertise bottleneck in kernel engineering.

Key Questions Answered

•Catastrophic Forgetting Prevention: When fine-tuning models without access to original pre-training data, reintroducing as little as 1% of pre-training data during post-training significantly reduces catastrophic forgetting. This allows models to reconnect with earlier representations. Developers doing heavy post-training workloads should monitor this actively, as even small fine-tuning tasks can compound into larger degradation over time.
•Kernel Optimization Economics: Speeding up a single matrix multiplication kernel — which executes billions or trillions of times inside one model — can translate to hundreds of billions of dollars in savings at frontier scale. Even for smaller deployments, a 10x kernel speed improvement is economically equivalent to purchasing 10x more GPU compute hardware.
•Verifiable Rewards for RL Training: AMD's kernel generation pipeline uses GPU profiler output as a verifiable reward signal for reinforcement learning, similar to how math correctness verified ChatGPT's reasoning training. Because profiler speed metrics are objective and non-subjective, they feed directly back into post-training loops without requiring costly human preference labeling.
•AMD's Open-Source Advantage: AMD's ROCm software stack is open-source, unlike NVIDIA's CUDA. This means language models can train directly on ROCm documentation and code, enabling AI agents to learn AMD-specific kernel writing more effectively. Developers building on AMD hardware benefit from this transparency when using AI coding tools like Cursor to generate optimized kernels.
•Kernel Engineering Skill Gap: Writing optimized GPU kernels requires simultaneous expertise in GPU architecture specifics and model mathematics — a combination rare enough to bottleneck even frontier labs. A complex kernel can take a non-expert months and an expert several weeks to write manually, making AI-assisted kernel generation a high-leverage productivity multiplier across the entire AI development stack.

Notable Moment

When asked whether faster kernel optimization might relieve the global chip shortage pressure, Zhou flatly dismissed the idea — stating that demand for compute is effectively infinite and no organization has reached a point where efficiency gains reduce their appetite for more hardware.

Know someone who'd find this useful?