Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

While de-duplicating protein databases helps learn diverse structures, subtle variations in similar sequences are essential for learning function, as a single mutation can be catastrophic. This justifies training models on massive, unclustered datasets to capture fine-grained functional determinants.

Related Insights

To evolve AI from pattern matching to understanding physics for protein engineering, structural data is insufficient. Models need physical parameters like Gibbs free energy (delta-G), obtainable from affinity measurements, to become truly predictive and transformative for therapeutic development.

The core philosophy behind ESMFold is that massive datasets and large transformer models can learn fundamental biological principles without needing built-in domain knowledge, applying Rich Sutton's "The Bitter Lesson" directly to bioinformatics.

The ESM-C model's performance leap came from adding billions of "noisy" protein sequences from environmental samples. This vast, diverse dataset overcame the limitations of curated databases like Uniref, removing the data bottleneck and revealing clear scaling laws.

To break the data bottleneck in AI protein engineering, companies now generate massive synthetic datasets. By creating novel "synthetic epitopes" and measuring their binding, they can produce thousands of validated positive and negative training examples in a single experiment, massively accelerating model development.

Models like AlphaFold don't solve protein folding from physics alone. They heavily rely on co-evolutionary data, where correlated mutations across species provide strong hints about which amino acids are physically close. This dramatically constrains the search space for the final structure.

The success of protein language models can be explained by Zellig Harris's 1954 linguistic theory. Just as a word's meaning is defined by its contexts, an amino acid's biological role is determined by the sequences it can appear in. The model learns this deep statistical structure, effectively learning biology.

Contrary to trends in other AI fields, structural biology problems are not yet dominated by simple, scaled-up transformers. Specialized architectures that bake in physical priors, like equivariance, still yield vastly superior performance, as the domain's complexity requires strong inductive biases.

Current AI for protein engineering relies on small public datasets like the PDB (~10,000 structures), causing models to "hallucinate" or default to known examples. This data bottleneck, orders of magnitude smaller than data used for LLMs, hinders the development of novel therapeutics.

Trained only on sequence prediction, ESM-C independently developed a hierarchical feature space mirroring decades of human scientific discovery. Its learned representations range from basic biochemical properties to complex, abstract functional concepts, all without prior biological knowledge.

While petabytes of observational DNA sequence data exist, it's insufficient for the next wave of AI. The key to creating powerful, functional models is generating causal data—from experiments that systematically test function—which is a current data bottleneck.

Seemingly Redundant Protein Data Is Key to Learning Function, Not Just Structure | RiffOn