What are the key takeaways from this Latent Space episode?

Key insights include: **Metagenomic data eliminates scaling bottleneck:** ESM2 showed diminishing returns because it trained only on UniRef sequences. Adding metagenomic data — sequences collected from hydrothermal vents, deep oceans, soil, and gut environments — restored clean scaling laws for ESMC. The lesson: biological diversity of training data matters more than parameter count alone. Researchers building bio-foundation models should prioritize sourcing sequences from extreme and underrepresented ecological niches before scaling compute.; **Sparse autoencoders reveal emergent biological hierarchy:** Training sparse autoencoders across all layers of ESMC's 300M, 600M, and 6B parameter models reveals a feature hierarchy matching decades of experimental biology — from basic biochemical properties up to abstract functional themes — without any prior biological knowledge encoded. Teams doing mechanistic interpretability on biology models should apply SAEs layer-by-layer to surface latent biological variables the model uses for sequence prediction.; **Antibody design without MSAs reaches therapeutic affinity:** ESMC designs single-chain antibodies (SCFVs) that reach binding affinity levels required for therapeutic function, without using multiple sequence alignments. Antibodies evolve toward diversity rather than conservation, making MSA-based approaches structurally disadvantaged. Protein engineers targeting therapeutic modalities — which represent roughly 25% of new drugs — should evaluate world-model search approaches over MSA-dependent pipelines for antibody CDR design.

What did Alex Rives discuss on Latent Space?

Alex Rives, Head of Science at Biohub, presents ESM Cambrian (ESMC), a 6-billion parameter protein language model trained on 6.8 billion non-redundant protein sequences. The model predicts protein structure, enables antibody design, and uses sparse autoencoders to reveal emergent biological features — all without multiple sequence alignments or hand-engineered priors. Key topics include: **Metagenomic data eliminates scaling bottleneck:** ESM2 showed diminishing returns because it trained only on UniRef sequences. Adding metagenomic data — sequences collected from hydrothermal vents, deep oceans, soil, and gut environments — restored clean scaling laws for ESMC. The lesson: biological diversity of training data matters more than parameter count alone. Researchers building bio-foundation models should prioritize sourcing sequences from extreme and underrepresented ecological niches before scaling compute.; **Sparse autoencoders reveal emergent biological hierarchy:** Training sparse autoencoders across all layers of ESMC's 300M, 600M, and 6B parameter models reveals a feature hierarchy matching decades of experimental biology — from basic biochemical properties up to abstract functional themes — without any prior biological knowledge encoded. Teams doing mechanistic interpretability on biology models should apply SAEs layer-by-layer to surface latent biological variables the model uses for sequence prediction..

How long is this episode of Latent Space?

This episode is 70 minutes long. SignalCast provides an AI-generated summary so you can get the key insights in about 3 minutes.

Latent Space

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

May 27, 2026

70 min episode · 3 min read

Alex Rives

Episode

70 min

Read time

3 min

Topics

Startups, Fundraising & VC, Design & UX

AI-Generated Summary

Published May 28, 2026

Key Takeaways

✓Metagenomic data eliminates scaling bottleneck: ESM2 showed diminishing returns because it trained only on UniRef sequences. Adding metagenomic data — sequences collected from hydrothermal vents, deep oceans, soil, and gut environments — restored clean scaling laws for ESMC. The lesson: biological diversity of training data matters more than parameter count alone. Researchers building bio-foundation models should prioritize sourcing sequences from extreme and underrepresented ecological niches before scaling compute.
✓Sparse autoencoders reveal emergent biological hierarchy: Training sparse autoencoders across all layers of ESMC's 300M, 600M, and 6B parameter models reveals a feature hierarchy matching decades of experimental biology — from basic biochemical properties up to abstract functional themes — without any prior biological knowledge encoded. Teams doing mechanistic interpretability on biology models should apply SAEs layer-by-layer to surface latent biological variables the model uses for sequence prediction.
✓Antibody design without MSAs reaches therapeutic affinity: ESMC designs single-chain antibodies (SCFVs) that reach binding affinity levels required for therapeutic function, without using multiple sequence alignments. Antibodies evolve toward diversity rather than conservation, making MSA-based approaches structurally disadvantaged. Protein engineers targeting therapeutic modalities — which represent roughly 25% of new drugs — should evaluate world-model search approaches over MSA-dependent pipelines for antibody CDR design.
✓World-model search replaces explicit programming for protein design: Rather than encoding biological rules or structural priors, ESMC treats protein design as a search problem over a predictive world model. Mini-protein binders and SCFVs emerge from searching the model's learned representation space against design criteria. Computational biology teams can operationalize this by using ESMC's MIT-licensed weights to run generative searches rather than building task-specific supervised models for each design objective.
✓Atlas of 1.1 billion predicted structures enables cross-evolution linkage: Biohub clustered 6.8 billion sequences at 70% sequence identity, producing ~1.2 billion clusters with predicted structures. Computing features across all clusters surfaces connections between distantly related proteins — such as gene editing systems with no sequence similarity but shared structural motifs. Researchers mining for novel enzymes or gene editors should query this atlas using feature-space proximity rather than sequence-based BLAST searches.

What It Covers

Alex Rives, Head of Science at Biohub, presents ESM Cambrian (ESMC), a 6-billion parameter protein language model trained on 6.8 billion non-redundant protein sequences. The model predicts protein structure, enables antibody design, and uses sparse autoencoders to reveal emergent biological features — all without multiple sequence alignments or hand-engineered priors.

Key Questions Answered

•Metagenomic data eliminates scaling bottleneck: ESM2 showed diminishing returns because it trained only on UniRef sequences. Adding metagenomic data — sequences collected from hydrothermal vents, deep oceans, soil, and gut environments — restored clean scaling laws for ESMC. The lesson: biological diversity of training data matters more than parameter count alone. Researchers building bio-foundation models should prioritize sourcing sequences from extreme and underrepresented ecological niches before scaling compute.
•Sparse autoencoders reveal emergent biological hierarchy: Training sparse autoencoders across all layers of ESMC's 300M, 600M, and 6B parameter models reveals a feature hierarchy matching decades of experimental biology — from basic biochemical properties up to abstract functional themes — without any prior biological knowledge encoded. Teams doing mechanistic interpretability on biology models should apply SAEs layer-by-layer to surface latent biological variables the model uses for sequence prediction.
•Antibody design without MSAs reaches therapeutic affinity: ESMC designs single-chain antibodies (SCFVs) that reach binding affinity levels required for therapeutic function, without using multiple sequence alignments. Antibodies evolve toward diversity rather than conservation, making MSA-based approaches structurally disadvantaged. Protein engineers targeting therapeutic modalities — which represent roughly 25% of new drugs — should evaluate world-model search approaches over MSA-dependent pipelines for antibody CDR design.
•World-model search replaces explicit programming for protein design: Rather than encoding biological rules or structural priors, ESMC treats protein design as a search problem over a predictive world model. Mini-protein binders and SCFVs emerge from searching the model's learned representation space against design criteria. Computational biology teams can operationalize this by using ESMC's MIT-licensed weights to run generative searches rather than building task-specific supervised models for each design objective.
•Atlas of 1.1 billion predicted structures enables cross-evolution linkage: Biohub clustered 6.8 billion sequences at 70% sequence identity, producing ~1.2 billion clusters with predicted structures. Computing features across all clusters surfaces connections between distantly related proteins — such as gene editing systems with no sequence similarity but shared structural motifs. Researchers mining for novel enzymes or gene editors should query this atlas using feature-space proximity rather than sequence-based BLAST searches.
•Virtual Biology Initiative targets cellular-scale data generation: Biohub commits $400M internally and $100M externally to generate cellular biology data at scale, prioritizing perturbation biology, spatial transcriptomics, and multi-modal single-cell measurements. Current cell atlases contain roughly one billion cells; the initiative targets multiple orders of magnitude beyond that. The core design principle mirrors protein modeling: expose the model to interventions across as many cellular contexts as possible to enable generalization to unobserved experiments.

Notable Moment

Rives notes that ESM2 appeared to hit diminishing returns on scaling — which could have ended the research direction entirely. The fix turned out to be data composition, not architecture. Adding metagenomic sequences restored a clean, predictable scaling law, validating the bitter lesson for protein biology years after the initial bet.

Know someone who'd find this useful?

You just read a 3-minute summary of a 67-minute episode.

Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

Why AI Infrastructure must evolve for Agent Experience — Akshat Bubna, Modal CTO

Jul 8 · 57 min

No Priors: Artificial Intelligence | Technology | Startups

Biohub: The Future of Biology is Open-Source with Co-Founders Mark Zuckerberg, Priscilla Chan, and Head of Science Alex Rives

Jun 10

🔬 The Coolest Diffusion Research Isn't in LLMs — Evan Feinberg & Sergey Edunov, Genesis Molecular AI

Jul 1 · 108 min

10% Happier with Dan Harris

Perfectionism, Burnout, and Self-Doubt: Break the Loop with the Science of Mattering | Gordon Flett

Jul 13

Books, tools, and gear mentioned in this episode

SignalCast may earn commission on purchases via these links.

Tools

UniRef
“ESM2 showed diminishing returns because it trained only on UniRef sequences.”
ESM Cambrian (ESMC)By guest
by BioHub
“Alex Rives, Head of Science at Biohub, presents ESM Cambrian (ESMC), a 6-billion parameter protein language model trained on 6.8 billion non-redundant protein sequences.”
ESM2By guest
by BioHub
“ESM2 showed diminishing returns because it trained only on UniRef sequences. Adding metagenomic data — sequences collected from hydrothermal vents, deep oceans, soil, and gut environments — restored clean scaling laws for ESMC.”
BLAST
“Researchers mining for novel enzymes or gene editors should query this atlas using feature-space proximity rather than sequence-based BLAST searches.”

Similar Episodes

Related episodes from other podcasts

No Priors: Artificial Intelligence | Technology | Startups

Jun 10

Biohub: The Future of Biology is Open-Source with Co-Founders Mark Zuckerberg, Priscilla Chan, and Head of Science Alex Rives

10% Happier with Dan Harris

Jul 13

Explore Related Topics

🚀Startups 💰Fundraising & VC 🎨Design & UX

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's Startups & Product Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into Latent Space.

Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for one show.

Start My Monday Digest

No credit card · Unsubscribe anytime

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

Why AI Infrastructure must evolve for Agent Experience — Akshat Bubna, Modal CTO

Biohub: The Future of Biology is Open-Source with Co-Founders Mark Zuckerberg, Priscilla Chan, and Head of Science Alex Rives

🔬 The Coolest Diffusion Research Isn't in LLMs — Evan Feinberg & Sergey Edunov, Genesis Molecular AI

Perfectionism, Burnout, and Self-Doubt: Break the Loop with the Science of Mattering | Gordon Flett

Books, tools, and gear mentioned in this episode

Tools

More from Latent Space

Why AI Infrastructure must evolve for Agent Experience — Akshat Bubna, Modal CTO

🔬 The Coolest Diffusion Research Isn't in LLMs — Evan Feinberg & Sergey Edunov, Genesis Molecular AI

Why the Frontier Ecosystem must be Open — Matei Zaharia and Reynold Xin, Databricks

Red-Teaming after Mythos — Zico Kolter & Matt Fredrikson, Gray Swan

The Professor of Outputmaxxing — Anjney Midha, AMP

Similar Episodes

Biohub: The Future of Biology is Open-Source with Co-Founders Mark Zuckerberg, Priscilla Chan, and Head of Science Alex Rives

Perfectionism, Burnout, and Self-Doubt: Break the Loop with the Science of Mattering | Gordon Flett

Waking Up Your Spiritual Brain: Part 1

How Modern Science Got Consciousness Wrong From the Start | Philip Goff

Science of Attraction, Compatibility & Romance | Dr. Paul Eastwick

Explore Related Topics

You're clearly into Latent Space.