Alex Rives

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

May 27, 202670 minHead of Science at BioHub

AI Summary

→ WHAT IT COVERS Alex Rives, Head of Science at Biohub, presents ESM Cambrian (ESMC), a 6-billion parameter protein language model trained on 6.8 billion non-redundant protein sequences. The model predicts protein structure, enables antibody design, and uses sparse autoencoders to reveal emergent biological features — all without multiple sequence alignments or hand-engineered priors. → KEY INSIGHTS - **Metagenomic data eliminates scaling bottleneck:** ESM2 showed diminishing returns because it trained only on UniRef sequences. Adding metagenomic data — sequences collected from hydrothermal vents, deep oceans, soil, and gut environments — restored clean scaling laws for ESMC. The lesson: biological diversity of training data matters more than parameter count alone. Researchers building bio-foundation models should prioritize sourcing sequences from extreme and underrepresented ecological niches before scaling compute. - **Sparse autoencoders reveal emergent biological hierarchy:** Training sparse autoencoders across all layers of ESMC's 300M, 600M, and 6B parameter models reveals a feature hierarchy matching decades of experimental biology — from basic biochemical properties up to abstract functional themes — without any prior biological knowledge encoded. Teams doing mechanistic interpretability on biology models should apply SAEs layer-by-layer to surface latent biological variables the model uses for sequence prediction. - **Antibody design without MSAs reaches therapeutic affinity:** ESMC designs single-chain antibodies (SCFVs) that reach binding affinity levels required for therapeutic function, without using multiple sequence alignments. Antibodies evolve toward diversity rather than conservation, making MSA-based approaches structurally disadvantaged. Protein engineers targeting therapeutic modalities — which represent roughly 25% of new drugs — should evaluate world-model search approaches over MSA-dependent pipelines for antibody CDR design. - **World-model search replaces explicit programming for protein design:** Rather than encoding biological rules or structural priors, ESMC treats protein design as a search problem over a predictive world model. Mini-protein binders and SCFVs emerge from searching the model's learned representation space against design criteria. Computational biology teams can operationalize this by using ESMC's MIT-licensed weights to run generative searches rather than building task-specific supervised models for each design objective. - **Atlas of 1.1 billion predicted structures enables cross-evolution linkage:** Biohub clustered 6.8 billion sequences at 70% sequence identity, producing ~1.2 billion clusters with predicted structures. Computing features across all clusters surfaces connections between distantly related proteins — such as gene editing systems with no sequence similarity but shared structural motifs. Researchers mining for novel enzymes or gene editors should query this atlas using feature-space proximity rather than sequence-based BLAST searches. - **Virtual Biology Initiative targets cellular-scale data generation:** Biohub commits $400M internally and $100M externally to generate cellular biology data at scale, prioritizing perturbation biology, spatial transcriptomics, and multi-modal single-cell measurements. Current cell atlases contain roughly one billion cells; the initiative targets multiple orders of magnitude beyond that. The core design principle mirrors protein modeling: expose the model to interventions across as many cellular contexts as possible to enable generalization to unobserved experiments. → NOTABLE MOMENT Rives notes that ESM2 appeared to hit diminishing returns on scaling — which could have ended the research direction entirely. The fix turned out to be data composition, not architecture. Adding metagenomic sequences restored a clean, predictable scaling law, validating the bitter lesson for protein biology years after the initial bet. 💼 SPONSORS None detected 🏷️ Protein Language Models, Computational Biology, Mechanistic Interpretability, Antibody Design, Metagenomic Sequencing, Virtual Cell

Read Full Summary Listen

Featured On 1 Podcast

Latent Space

Top resources Alex Rives mentions

ESM Cambrian (ESMC)

ESM2

All Appearances

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

AI Summary

Never miss Alex Rives's insights