/

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Latent Space: The AI Engineer Podcast · May 27, 2026

Alex Rives of BioHub explains ESM-C, a world model for protein biology. Scaling laws & metagenomic data unlock emergent protein design capabilities.

Noisy Metagenomic Data, Not Curated Sequences, Unlocked Protein Model Scaling

The ESM-C model's performance leap came from adding billions of "noisy" protein sequences from environmental samples. This vast, diverse dataset overcame the limitations of curated databases like Uniref, removing the data bottleneck and revealing clear scaling laws.

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub thumbnail

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Latent Space: The AI Engineer Podcast·2 months ago

Protein Models Learn Function by Applying a 1954 Linguistic Theory to Amino Acids

The success of protein language models can be explained by Zellig Harris's 1954 linguistic theory. Just as a word's meaning is defined by its contexts, an amino acid's biological role is determined by the sequences it can appear in. The model learns this deep statistical structure, effectively learning biology.

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub thumbnail

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Latent Space: The AI Engineer Podcast·2 months ago

Building Predictive Cell Models Requires Scaling Interventional, Not Just Observational, Data

To create a predictive "virtual cell," data collection must shift from passive observation to active intervention. The strategy is to massively scale perturbation experiments (like Perturb-seq) across countless contexts and measure multi-modal responses, teaching the model cause and effect.

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub thumbnail

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Latent Space: The AI Engineer Podcast·2 months ago

Protein Language Models Spontaneously Learn Biology's Textbook Hierarchy

Trained only on sequence prediction, ESM-C independently developed a hierarchical feature space mirroring decades of human scientific discovery. Its learned representations range from basic biochemical properties to complex, abstract functional concepts, all without prior biological knowledge.

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub thumbnail

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Latent Space: The AI Engineer Podcast·2 months ago

BioHub Designs Therapeutic Antibodies by Searching, Not Generating, Its Protein World Model

ESM-C is used as a predictive "world model" rather than a direct generator. Protein design, including for complex antibodies (SCFVs), is framed as a search problem: find molecules within the model's learned space that satisfy desired criteria. This approach is achieving therapeutically relevant binding affinities.

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub thumbnail

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Latent Space: The AI Engineer Podcast·2 months ago

The Next Scientific Paradigm in Biology is an AI-Driven Experimental Feedback Loop

Future progress in biology requires moving beyond static models. The new paradigm involves an AI that reasons over hypotheses, prioritizes experiments, learns from the empirical outcomes, and updates its internal world model. This creates a scalable, closed-loop system for scientific discovery.

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub thumbnail

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Latent Space: The AI Engineer Podcast·2 months ago

Current Virtual Cell Models Fail at Prediction; True Oracles Must Generalize to Novel Interventions

Today's "virtual cell" models represent training data well but cannot predict outcomes for novel interventions. The next frontier is building models that generalize to serve as true predictive oracles for experiments that haven't yet been performed, a key focus for BioHub.

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub thumbnail

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Latent Space: The AI Engineer Podcast·2 months ago

BioHub's Alex Rives Bets on Scaling Laws, Not Human Priors, to Model Proteins

The core philosophy behind ESMFold is that massive datasets and large transformer models can learn fundamental biological principles without needing built-in domain knowledge, applying Rich Sutton's "The Bitter Lesson" directly to bioinformatics.

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub thumbnail

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Latent Space: The AI Engineer Podcast·2 months ago

Seemingly Redundant Protein Data Is Key to Learning Function, Not Just Structure

While de-duplicating protein databases helps learn diverse structures, subtle variations in similar sequences are essential for learning function, as a single mutation can be catastrophic. This justifies training models on massive, unclustered datasets to capture fine-grained functional determinants.

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub thumbnail

🔬ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Latent Space: The AI Engineer Podcast·2 months ago