Skip to main content
Latent Space

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

70 min episode · 3 min read
·

Episode

70 min

Read time

3 min

AI-Generated Summary

Key Takeaways

  • Metagenomic data eliminates scaling bottleneck: ESM2 showed diminishing returns because it trained only on UniRef sequences. Adding metagenomic data — sequences collected from hydrothermal vents, deep oceans, soil, and gut environments — restored clean scaling laws for ESMC. The lesson: biological diversity of training data matters more than parameter count alone. Researchers building bio-foundation models should prioritize sourcing sequences from extreme and underrepresented ecological niches before scaling compute.
  • Sparse autoencoders reveal emergent biological hierarchy: Training sparse autoencoders across all layers of ESMC's 300M, 600M, and 6B parameter models reveals a feature hierarchy matching decades of experimental biology — from basic biochemical properties up to abstract functional themes — without any prior biological knowledge encoded. Teams doing mechanistic interpretability on biology models should apply SAEs layer-by-layer to surface latent biological variables the model uses for sequence prediction.
  • Antibody design without MSAs reaches therapeutic affinity: ESMC designs single-chain antibodies (SCFVs) that reach binding affinity levels required for therapeutic function, without using multiple sequence alignments. Antibodies evolve toward diversity rather than conservation, making MSA-based approaches structurally disadvantaged. Protein engineers targeting therapeutic modalities — which represent roughly 25% of new drugs — should evaluate world-model search approaches over MSA-dependent pipelines for antibody CDR design.
  • World-model search replaces explicit programming for protein design: Rather than encoding biological rules or structural priors, ESMC treats protein design as a search problem over a predictive world model. Mini-protein binders and SCFVs emerge from searching the model's learned representation space against design criteria. Computational biology teams can operationalize this by using ESMC's MIT-licensed weights to run generative searches rather than building task-specific supervised models for each design objective.
  • Atlas of 1.1 billion predicted structures enables cross-evolution linkage: Biohub clustered 6.8 billion sequences at 70% sequence identity, producing ~1.2 billion clusters with predicted structures. Computing features across all clusters surfaces connections between distantly related proteins — such as gene editing systems with no sequence similarity but shared structural motifs. Researchers mining for novel enzymes or gene editors should query this atlas using feature-space proximity rather than sequence-based BLAST searches.

What It Covers

Alex Rives, Head of Science at Biohub, presents ESM Cambrian (ESMC), a 6-billion parameter protein language model trained on 6.8 billion non-redundant protein sequences. The model predicts protein structure, enables antibody design, and uses sparse autoencoders to reveal emergent biological features — all without multiple sequence alignments or hand-engineered priors.

Key Questions Answered

  • Metagenomic data eliminates scaling bottleneck: ESM2 showed diminishing returns because it trained only on UniRef sequences. Adding metagenomic data — sequences collected from hydrothermal vents, deep oceans, soil, and gut environments — restored clean scaling laws for ESMC. The lesson: biological diversity of training data matters more than parameter count alone. Researchers building bio-foundation models should prioritize sourcing sequences from extreme and underrepresented ecological niches before scaling compute.
  • Sparse autoencoders reveal emergent biological hierarchy: Training sparse autoencoders across all layers of ESMC's 300M, 600M, and 6B parameter models reveals a feature hierarchy matching decades of experimental biology — from basic biochemical properties up to abstract functional themes — without any prior biological knowledge encoded. Teams doing mechanistic interpretability on biology models should apply SAEs layer-by-layer to surface latent biological variables the model uses for sequence prediction.
  • Antibody design without MSAs reaches therapeutic affinity: ESMC designs single-chain antibodies (SCFVs) that reach binding affinity levels required for therapeutic function, without using multiple sequence alignments. Antibodies evolve toward diversity rather than conservation, making MSA-based approaches structurally disadvantaged. Protein engineers targeting therapeutic modalities — which represent roughly 25% of new drugs — should evaluate world-model search approaches over MSA-dependent pipelines for antibody CDR design.
  • World-model search replaces explicit programming for protein design: Rather than encoding biological rules or structural priors, ESMC treats protein design as a search problem over a predictive world model. Mini-protein binders and SCFVs emerge from searching the model's learned representation space against design criteria. Computational biology teams can operationalize this by using ESMC's MIT-licensed weights to run generative searches rather than building task-specific supervised models for each design objective.
  • Atlas of 1.1 billion predicted structures enables cross-evolution linkage: Biohub clustered 6.8 billion sequences at 70% sequence identity, producing ~1.2 billion clusters with predicted structures. Computing features across all clusters surfaces connections between distantly related proteins — such as gene editing systems with no sequence similarity but shared structural motifs. Researchers mining for novel enzymes or gene editors should query this atlas using feature-space proximity rather than sequence-based BLAST searches.
  • Virtual Biology Initiative targets cellular-scale data generation: Biohub commits $400M internally and $100M externally to generate cellular biology data at scale, prioritizing perturbation biology, spatial transcriptomics, and multi-modal single-cell measurements. Current cell atlases contain roughly one billion cells; the initiative targets multiple orders of magnitude beyond that. The core design principle mirrors protein modeling: expose the model to interventions across as many cellular contexts as possible to enable generalization to unobserved experiments.

Notable Moment

Rives notes that ESM2 appeared to hit diminishing returns on scaling — which could have ended the research direction entirely. The fix turned out to be data composition, not architecture. Adding metagenomic sequences restored a clean, predictable scaling law, validating the bitter lesson for protein biology years after the initial bet.

Know someone who'd find this useful?

You just read a 3-minute summary of a 67-minute episode.

Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

More from Latent Space

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

You're clearly into Latent Space.

Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime