🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub
Episode
70 min
Read time
3 min
AI-Generated Summary
Key Takeaways
- ✓Metagenomic data eliminates scaling bottleneck: ESM2 showed diminishing returns because it trained only on UniRef sequences. Adding metagenomic data — sequences collected from hydrothermal vents, deep oceans, soil, and gut environments — restored clean scaling laws for ESMC. The lesson: biological diversity of training data matters more than parameter count alone. Researchers building bio-foundation models should prioritize sourcing sequences from extreme and underrepresented ecological niches before scaling compute.
- ✓Sparse autoencoders reveal emergent biological hierarchy: Training sparse autoencoders across all layers of ESMC's 300M, 600M, and 6B parameter models reveals a feature hierarchy matching decades of experimental biology — from basic biochemical properties up to abstract functional themes — without any prior biological knowledge encoded. Teams doing mechanistic interpretability on biology models should apply SAEs layer-by-layer to surface latent biological variables the model uses for sequence prediction.
- ✓Antibody design without MSAs reaches therapeutic affinity: ESMC designs single-chain antibodies (SCFVs) that reach binding affinity levels required for therapeutic function, without using multiple sequence alignments. Antibodies evolve toward diversity rather than conservation, making MSA-based approaches structurally disadvantaged. Protein engineers targeting therapeutic modalities — which represent roughly 25% of new drugs — should evaluate world-model search approaches over MSA-dependent pipelines for antibody CDR design.
- ✓World-model search replaces explicit programming for protein design: Rather than encoding biological rules or structural priors, ESMC treats protein design as a search problem over a predictive world model. Mini-protein binders and SCFVs emerge from searching the model's learned representation space against design criteria. Computational biology teams can operationalize this by using ESMC's MIT-licensed weights to run generative searches rather than building task-specific supervised models for each design objective.
- ✓Atlas of 1.1 billion predicted structures enables cross-evolution linkage: Biohub clustered 6.8 billion sequences at 70% sequence identity, producing ~1.2 billion clusters with predicted structures. Computing features across all clusters surfaces connections between distantly related proteins — such as gene editing systems with no sequence similarity but shared structural motifs. Researchers mining for novel enzymes or gene editors should query this atlas using feature-space proximity rather than sequence-based BLAST searches.
What It Covers
Alex Rives, Head of Science at Biohub, presents ESM Cambrian (ESMC), a 6-billion parameter protein language model trained on 6.8 billion non-redundant protein sequences. The model predicts protein structure, enables antibody design, and uses sparse autoencoders to reveal emergent biological features — all without multiple sequence alignments or hand-engineered priors.
Key Questions Answered
- •Metagenomic data eliminates scaling bottleneck: ESM2 showed diminishing returns because it trained only on UniRef sequences. Adding metagenomic data — sequences collected from hydrothermal vents, deep oceans, soil, and gut environments — restored clean scaling laws for ESMC. The lesson: biological diversity of training data matters more than parameter count alone. Researchers building bio-foundation models should prioritize sourcing sequences from extreme and underrepresented ecological niches before scaling compute.
- •Sparse autoencoders reveal emergent biological hierarchy: Training sparse autoencoders across all layers of ESMC's 300M, 600M, and 6B parameter models reveals a feature hierarchy matching decades of experimental biology — from basic biochemical properties up to abstract functional themes — without any prior biological knowledge encoded. Teams doing mechanistic interpretability on biology models should apply SAEs layer-by-layer to surface latent biological variables the model uses for sequence prediction.
- •Antibody design without MSAs reaches therapeutic affinity: ESMC designs single-chain antibodies (SCFVs) that reach binding affinity levels required for therapeutic function, without using multiple sequence alignments. Antibodies evolve toward diversity rather than conservation, making MSA-based approaches structurally disadvantaged. Protein engineers targeting therapeutic modalities — which represent roughly 25% of new drugs — should evaluate world-model search approaches over MSA-dependent pipelines for antibody CDR design.
- •World-model search replaces explicit programming for protein design: Rather than encoding biological rules or structural priors, ESMC treats protein design as a search problem over a predictive world model. Mini-protein binders and SCFVs emerge from searching the model's learned representation space against design criteria. Computational biology teams can operationalize this by using ESMC's MIT-licensed weights to run generative searches rather than building task-specific supervised models for each design objective.
- •Atlas of 1.1 billion predicted structures enables cross-evolution linkage: Biohub clustered 6.8 billion sequences at 70% sequence identity, producing ~1.2 billion clusters with predicted structures. Computing features across all clusters surfaces connections between distantly related proteins — such as gene editing systems with no sequence similarity but shared structural motifs. Researchers mining for novel enzymes or gene editors should query this atlas using feature-space proximity rather than sequence-based BLAST searches.
- •Virtual Biology Initiative targets cellular-scale data generation: Biohub commits $400M internally and $100M externally to generate cellular biology data at scale, prioritizing perturbation biology, spatial transcriptomics, and multi-modal single-cell measurements. Current cell atlases contain roughly one billion cells; the initiative targets multiple orders of magnitude beyond that. The core design principle mirrors protein modeling: expose the model to interventions across as many cellular contexts as possible to enable generalization to unobserved experiments.
Notable Moment
Rives notes that ESM2 appeared to hit diminishing returns on scaling — which could have ended the research direction entirely. The fix turned out to be data composition, not architecture. Adding metagenomic sequences restored a clean, predictable scaling law, validating the bitter lesson for protein biology years after the initial bet.
You just read a 3-minute summary of a 67-minute episode.
Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Latent Space
The Age of Async Agents — Cognition's Walden Yan & OpenInspect's Cole Murray
May 28 · 68 min
Up First (NPR)
Israel Ramps Up Attacks Amid Iran Talks, E. Jean Carroll Investigation, CBS Overhaul
May 29
More from Latent Space
Giving Agents Computers — Ivan Burazin, Daytona
May 21 · 70 min
The Daily (NYT)
Stranded in the Strait of Hormuz
May 29
More from Latent Space
We summarize every new episode. Want them in your inbox?
The Age of Async Agents — Cognition's Walden Yan & OpenInspect's Cole Murray
Giving Agents Computers — Ivan Burazin, Daytona
Railway: The Agent-Native Cloud — Jake Cooper
The Next War Is Already Here. The West Isn't Ready. — Yaroslav Azhnyuk, The Fourth Law & Guest Host Noah Smith, Noahpinion
AI-Native Healthcare: 100M Doctor Visits, 10–20 Hours Saved, Prior Auth in Minutes — Janie Lee & Chai Asawa, Abridge
Similar Episodes
Related episodes from other podcasts
Up First (NPR)
May 29
Israel Ramps Up Attacks Amid Iran Talks, E. Jean Carroll Investigation, CBS Overhaul
The Daily (NYT)
May 29
Stranded in the Strait of Hormuz
10% Happier with Dan Harris
May 29
Anxiety Narrows Your Brain. Here's How to Widen It Back Out. | Susa Talan
Feel Better, Live More
May 28
BITESIZE | The 5 Minute Habits That Can Transform Your Health | Dr Rangan Chatterjee and Dr Ayan Panja #661
The Tim Ferriss Show
May 28
#867: Dr. Becky Kennedy — Parenting Strategies for Raising Resilient Kids, Plus Word-for-Word Scripts for Repairing Relationships, Setting Boundaries, and More (Repost)
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
You're clearly into Latent Space.
Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime