Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

The core philosophy behind ESMFold is that massive datasets and large transformer models can learn fundamental biological principles without needing built-in domain knowledge, applying Rich Sutton's "The Bitter Lesson" directly to bioinformatics.

Related Insights

A key strategy for improving results from generative protein models is "inference-time scaling." This involves generating a vast number of potential structures and then using a separate, fine-tuned scoring model to rank them. This search-and-rank process uncovers high-quality solutions the model might otherwise miss.

DE Shaw Research (DESRES) invested heavily in custom silicon for molecular dynamics (MD) to solve protein folding. In contrast, DeepMind's AlphaFold, using ML on experimental data, solved it on commodity hardware. This demonstrates data-driven approaches can be vastly more effective than brute-force simulation for complex scientific problems.

The ESM-C model's performance leap came from adding billions of "noisy" protein sequences from environmental samples. This vast, diverse dataset overcame the limitations of curated databases like Uniref, removing the data bottleneck and revealing clear scaling laws.

Unlike classic theories based on simple equations, large AI models represent a new kind of scientific object. Rather than being mere predictive tools, they could be a novel form of explanation that we must learn to manipulate through new operations like distillation and merging, much like Mathematica made massive equations workable.

An anecdote about a "wonky" BindCraft design with disconnected beta sheets, which experts predicted would fail, highlights a key trend. The resulting binder was one of the best ever produced, suggesting AI models are extracting structural principles that go beyond traditional human "protein literacy" and intuition.

The success of protein language models can be explained by Zellig Harris's 1954 linguistic theory. Just as a word's meaning is defined by its contexts, an amino acid's biological role is determined by the sequences it can appear in. The model learns this deep statistical structure, effectively learning biology.

Contrary to trends in other AI fields, structural biology problems are not yet dominated by simple, scaled-up transformers. Specialized architectures that bake in physical priors, like equivariance, still yield vastly superior performance, as the domain's complexity requires strong inductive biases.

Trained only on sequence prediction, ESM-C independently developed a hierarchical feature space mirroring decades of human scientific discovery. Its learned representations range from basic biochemical properties to complex, abstract functional concepts, all without prior biological knowledge.

ESM-C is used as a predictive "world model" rather than a direct generator. Protein design, including for complex antibodies (SCFVs), is framed as a search problem: find molecules within the model's learned space that satisfy desired criteria. This approach is achieving therapeutically relevant binding affinities.

Generate Biomedicines' AI learns the fundamental rules of protein structure and function, much like a language's grammar. This allows it to design entirely new proteins by generating novel "sentences" (sequences) that are biologically coherent and functional, rather than just mimicking existing ones found in nature.