The core philosophy behind ESMFold is that massive datasets and large transformer models can learn fundamental biological principles without needing built-in domain knowledge, applying Rich Sutton's "The Bitter Lesson" directly to bioinformatics.
While de-duplicating protein databases helps learn diverse structures, subtle variations in similar sequences are essential for learning function, as a single mutation can be catastrophic. This justifies training models on massive, unclustered datasets to capture fine-grained functional determinants.
The ESM-C model's performance leap came from adding billions of "noisy" protein sequences from environmental samples. This vast, diverse dataset overcame the limitations of curated databases like Uniref, removing the data bottleneck and revealing clear scaling laws.
Trained only on sequence prediction, ESM-C independently developed a hierarchical feature space mirroring decades of human scientific discovery. Its learned representations range from basic biochemical properties to complex, abstract functional concepts, all without prior biological knowledge.
Future progress in biology requires moving beyond static models. The new paradigm involves an AI that reasons over hypotheses, prioritizes experiments, learns from the empirical outcomes, and updates its internal world model. This creates a scalable, closed-loop system for scientific discovery.
To create a predictive "virtual cell," data collection must shift from passive observation to active intervention. The strategy is to massively scale perturbation experiments (like Perturb-seq) across countless contexts and measure multi-modal responses, teaching the model cause and effect.
Today's "virtual cell" models represent training data well but cannot predict outcomes for novel interventions. The next frontier is building models that generalize to serve as true predictive oracles for experiments that haven't yet been performed, a key focus for BioHub.
ESM-C is used as a predictive "world model" rather than a direct generator. Protein design, including for complex antibodies (SCFVs), is framed as a search problem: find molecules within the model's learned space that satisfy desired criteria. This approach is achieving therapeutically relevant binding affinities.
The success of protein language models can be explained by Zellig Harris's 1954 linguistic theory. Just as a word's meaning is defined by its contexts, an amino acid's biological role is determined by the sequences it can appear in. The model learns this deep statistical structure, effectively learning biology.
