Skip to main content
AR

Alex Rives

1episode
1podcast

We have 1 summarized appearance for Alex Rives so far. Browse all podcasts to discover more episodes.

Featured On 1 Podcast

All Appearances

1 episode

AI Summary

β†’ WHAT IT COVERS Alex Rives, Head of Science at Biohub, presents ESM Cambrian (ESMC), a 6-billion parameter protein language model trained on 6.8 billion non-redundant protein sequences. The model predicts protein structure, enables antibody design, and uses sparse autoencoders to reveal emergent biological features β€” all without multiple sequence alignments or hand-engineered priors. β†’ KEY INSIGHTS - **Metagenomic data eliminates scaling bottleneck:** ESM2 showed diminishing returns because it trained only on UniRef sequences. Adding metagenomic data β€” sequences collected from hydrothermal vents, deep oceans, soil, and gut environments β€” restored clean scaling laws for ESMC. The lesson: biological diversity of training data matters more than parameter count alone. Researchers building bio-foundation models should prioritize sourcing sequences from extreme and underrepresented ecological niches before scaling compute. - **Sparse autoencoders reveal emergent biological hierarchy:** Training sparse autoencoders across all layers of ESMC's 300M, 600M, and 6B parameter models reveals a feature hierarchy matching decades of experimental biology β€” from basic biochemical properties up to abstract functional themes β€” without any prior biological knowledge encoded. Teams doing mechanistic interpretability on biology models should apply SAEs layer-by-layer to surface latent biological variables the model uses for sequence prediction. - **Antibody design without MSAs reaches therapeutic affinity:** ESMC designs single-chain antibodies (SCFVs) that reach binding affinity levels required for therapeutic function, without using multiple sequence alignments. Antibodies evolve toward diversity rather than conservation, making MSA-based approaches structurally disadvantaged. Protein engineers targeting therapeutic modalities β€” which represent roughly 25% of new drugs β€” should evaluate world-model search approaches over MSA-dependent pipelines for antibody CDR design. - **World-model search replaces explicit programming for protein design:** Rather than encoding biological rules or structural priors, ESMC treats protein design as a search problem over a predictive world model. Mini-protein binders and SCFVs emerge from searching the model's learned representation space against design criteria. Computational biology teams can operationalize this by using ESMC's MIT-licensed weights to run generative searches rather than building task-specific supervised models for each design objective. - **Atlas of 1.1 billion predicted structures enables cross-evolution linkage:** Biohub clustered 6.8 billion sequences at 70% sequence identity, producing ~1.2 billion clusters with predicted structures. Computing features across all clusters surfaces connections between distantly related proteins β€” such as gene editing systems with no sequence similarity but shared structural motifs. Researchers mining for novel enzymes or gene editors should query this atlas using feature-space proximity rather than sequence-based BLAST searches. - **Virtual Biology Initiative targets cellular-scale data generation:** Biohub commits $400M internally and $100M externally to generate cellular biology data at scale, prioritizing perturbation biology, spatial transcriptomics, and multi-modal single-cell measurements. Current cell atlases contain roughly one billion cells; the initiative targets multiple orders of magnitude beyond that. The core design principle mirrors protein modeling: expose the model to interventions across as many cellular contexts as possible to enable generalization to unobserved experiments. β†’ NOTABLE MOMENT Rives notes that ESM2 appeared to hit diminishing returns on scaling β€” which could have ended the research direction entirely. The fix turned out to be data composition, not architecture. Adding metagenomic sequences restored a clean, predictable scaling law, validating the bitter lesson for protein biology years after the initial bet. πŸ’Ό SPONSORS None detected 🏷️ Protein Language Models, Computational Biology, Mechanistic Interpretability, Antibody Design, Metagenomic Sequencing, Virtual Cell

Never miss Alex Rives's insights

Subscribe to get AI-powered summaries of Alex Rives's podcast appearances delivered to your inbox weekly.

Start Free Today

No credit card required β€’ Free tier available