We scan new podcasts and send you the top 5 insights daily.
The success of protein language models can be explained by Zellig Harris's 1954 linguistic theory. Just as a word's meaning is defined by its contexts, an amino acid's biological role is determined by the sequences it can appear in. The model learns this deep statistical structure, effectively learning biology.
While de-duplicating protein databases helps learn diverse structures, subtle variations in similar sequences are essential for learning function, as a single mutation can be catastrophic. This justifies training models on massive, unclustered datasets to capture fine-grained functional determinants.
Instead of building from scratch, ProPhet leverages existing transformer models to create unique mathematical 'languages' for proteins and molecules. Their core innovation is an additional model that translates between them, creating a unified space to predict interactions at scale.
The next major AI breakthrough will come from applying generative models to complex systems beyond human language, such as biology. By treating biological processes as a unique "language," AI could discover novel therapeutics or research paths, leading to a "Move 37" moment in science.
The core philosophy behind ESMFold is that massive datasets and large transformer models can learn fundamental biological principles without needing built-in domain knowledge, applying Rich Sutton's "The Bitter Lesson" directly to bioinformatics.
Models like AlphaFold don't solve protein folding from physics alone. They heavily rely on co-evolutionary data, where correlated mutations across species provide strong hints about which amino acids are physically close. This dramatically constrains the search space for the final structure.
AI is moving beyond simply identifying patterns in existing research papers. It is now able to extrapolate fundamental biological principles, enabling it to understand complex systems from the ground up, like the relationship between atoms, molecules, and proteins.
Demis Hassabis argues that machine learning is the ideal framework for understanding biological systems. Unlike physics, which is elegantly described by mathematics, biology's messy, data-rich nature with many weak correlations is perfectly suited for ML to model and decipher.
Trained only on sequence prediction, ESM-C independently developed a hierarchical feature space mirroring decades of human scientific discovery. Its learned representations range from basic biochemical properties to complex, abstract functional concepts, all without prior biological knowledge.
Generate Biomedicines' AI learns the fundamental rules of protein structure and function, much like a language's grammar. This allows it to design entirely new proteins by generating novel "sentences" (sequences) that are biologically coherent and functional, rather than just mimicking existing ones found in nature.
Traditional science failed to create equations for complex biological systems because biology is too "bespoke." AI succeeds by discerning patterns from vast datasets, effectively serving as the "language" for modeling biology, much like mathematics is the language of physics.