The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

February 5, 2026

68 min episode · 2 min read

Mark Bissell,Myra Deng

Episode

68 min

Read time

2 min

Topics

Artificial Intelligence

AI-Generated Summary

Published Feb 6, 2026

Key Takeaways

✓Production Interpretability at Scale: Goodfire deploys real-time steering on trillion-parameter models like Qwen QwQ, demonstrating surgical control over specific behaviors through feature manipulation. Their forked SG-Lang codebase enables live activation steering during inference, showing interpretability techniques can scale beyond toy models to frontier systems requiring full H100 nodes for deployment.
✓SAE Limitations in Practice: Sparse autoencoders underperform raw activation probes for detecting harmful behaviors, hallucinations, and PII in production scenarios. While SAEs excel with noisy synthetic datasets requiring generalization, supervised probes trained directly on activations achieve better downstream performance metrics when clean labeled data exists, revealing unsupervised methods have specific optimal use cases.
✓Steering-Prompting Equivalence: Research from Ekdeep Singh establishes formal mathematical equivalence between activation steering and in-context learning. The framework predicts exact steering magnitudes needed to replicate prompting effects, including jailbreaks through many-shot examples. This enables converting between inference-time interventions and understanding their interchangeable nature for model control.
✓Post-Training Surgical Edits: Interpretability enables targeted removal of unintended behaviors like political bias or reward hacking without full retraining. Goodfire positions this as moving beyond crude reinforcement learning that only provides reward signals, toward expert feedback that surgically modifies specific model representations. The approach addresses issues like GPT-4o's sycophancy problems through precise internal adjustments.
✓Healthcare Biomarker Discovery: Partnership with Mayo Clinic, AHRQ Institute, and Prima Menta uses interpretability on genomics foundation models to identify novel Alzheimer's disease biomarkers. The technique extracts superhuman knowledge from narrow AI systems trained on medical imaging and genomic data, demonstrating interpretability as a scientific discovery tool beyond debugging or safety applications.

What It Covers

Goodfire AI announces $150M Series B at $1.25B valuation as the first mechanistic interpretability frontier lab applying research to production. Mark Bissell and Myra Deng discuss using sparse autoencoders and probes to understand model internals, enable surgical steering of behaviors, and solve real-world problems from PII detection at Rakuten to Alzheimer's biomarker discovery.

Key Questions Answered

•Production Interpretability at Scale: Goodfire deploys real-time steering on trillion-parameter models like Qwen QwQ, demonstrating surgical control over specific behaviors through feature manipulation. Their forked SG-Lang codebase enables live activation steering during inference, showing interpretability techniques can scale beyond toy models to frontier systems requiring full H100 nodes for deployment.
•SAE Limitations in Practice: Sparse autoencoders underperform raw activation probes for detecting harmful behaviors, hallucinations, and PII in production scenarios. While SAEs excel with noisy synthetic datasets requiring generalization, supervised probes trained directly on activations achieve better downstream performance metrics when clean labeled data exists, revealing unsupervised methods have specific optimal use cases.
•Steering-Prompting Equivalence: Research from Ekdeep Singh establishes formal mathematical equivalence between activation steering and in-context learning. The framework predicts exact steering magnitudes needed to replicate prompting effects, including jailbreaks through many-shot examples. This enables converting between inference-time interventions and understanding their interchangeable nature for model control.
•Post-Training Surgical Edits: Interpretability enables targeted removal of unintended behaviors like political bias or reward hacking without full retraining. Goodfire positions this as moving beyond crude reinforcement learning that only provides reward signals, toward expert feedback that surgically modifies specific model representations. The approach addresses issues like GPT-4o's sycophancy problems through precise internal adjustments.
•Healthcare Biomarker Discovery: Partnership with Mayo Clinic, AHRQ Institute, and Prima Menta uses interpretability on genomics foundation models to identify novel Alzheimer's disease biomarkers. The technique extracts superhuman knowledge from narrow AI systems trained on medical imaging and genomic data, demonstrating interpretability as a scientific discovery tool beyond debugging or safety applications.
•Rakuten PII Detection System: Deployed token-level PII classification using probes on language model activations processes all user queries daily. The system handles synthetic-to-real transfer learning, multilingual requirements across English and Japanese, and precise scrubbing without routing private data to downstream providers. This demonstrates production-grade interpretability solving compliance problems traditional guardrail models cannot address efficiently.

Notable Moment

The team revealed live demonstration limitations expose engineering challenges at scale. Their hastily assembled demo of steering a trillion-parameter model required custom infrastructure and proved fragile behind the scenes, highlighting how production interpretability demands solving both novel research problems and significant systems engineering hurdles that academic toy models never encounter.

Know someone who'd find this useful?

You just read a 3-minute summary of a 65-minute episode.

Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Apr 23 · 54 min

The Mel Robbins Podcast

Do THIS Every Day to Rewire Your Brain From Stress and Anxiety

Apr 27

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

Apr 22 · 72 min

The Model Health Show

The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow

Apr 27

Similar Episodes

Related episodes from other podcasts

The Mel Robbins Podcast

Apr 27

685: David Epstein - The Freedom Trap, Narrative Values, General Magic, The Nobel Prize Winner Who Simplified Everything, Wearing the Same Thing Everyday, and Why Constraints Are the Secret to Your Best Work

The AI Breakdown

Apr 26

Where the Economy Thrives After AI

Explore Related Topics

🤖Artificial Intelligence

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into Latent Space.

Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Do THIS Every Day to Rewire Your Brain From Stress and Anxiety

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow

More from Latent Space

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

Extreme Harness Engineering for Token Billionaires: 1M LOC, 1B toks/day, 0% human code, 0% human review — Ryan Lopopolo, OpenAI Frontier & Symphony

Similar Episodes

Do THIS Every Day to Rewire Your Brain From Stress and Anxiety

The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow

664. Britain in the 70s: Scandal in Downing Street (Part 3)

685: David Epstein - The Freedom Trap, Narrative Values, General Magic, The Nobel Prize Winner Who Simplified Everything, Wearing the Same Thing Everyday, and Why Constraints Are the Secret to Your Best Work

Where the Economy Thrives After AI

Explore Related Topics

You're clearly into Latent Space.