To evolve AI from pattern matching to understanding physics for protein engineering, structural data is insufficient. Models need physical parameters like Gibbs free energy (delta-G), obtainable from affinity measurements, to become truly predictive and transformative for therapeutic development.

Related Insights

AI modeling transforms drug development from a numbers game of screening millions of compounds to an engineering discipline. Researchers can model molecular systems upfront, understand key parameters, and design solutions for a specific problem, turning a costly screening process into a rapid, targeted design cycle.

Simple cell viability screens fail to identify powerful drug combinations where each component is ineffective on its own. AI can predict these synergies, but only if trained on mechanistic data that reveals how cells rewire their internal pathways in response to a drug.

The cost to generate the volume of protein affinity data from a single multi-week A-AlphaBio experiment using standard methods like surface plasmon resonance (SPR) would be an economically unfeasible $100-$500 million. This staggering cost difference illustrates the fundamental barrier that new high-throughput platforms are designed to overcome.

To break the data bottleneck in AI protein engineering, companies now generate massive synthetic datasets. By creating novel "synthetic epitopes" and measuring their binding, they can produce thousands of validated positive and negative training examples in a single experiment, massively accelerating model development.

While AI promises to design therapeutics computationally, it doesn't eliminate the need for physical lab work. Even if future models require no training data, their predicted outputs must be experimentally validated. This ensures a continuous, inescapable cycle where high-throughput data generation remains critical for progress.

To make genuine scientific breakthroughs, an AI needs to learn the abstract reasoning strategies and mental models of expert scientists. This involves teaching it higher-level concepts, such as thinking in terms of symmetries, a core principle in physics that current models lack.

The progress of AI in predicting cancer treatment is stalled not by algorithms, but by the data used to train them. Relying solely on static genetic data is insufficient. The critical missing piece is functional, contextual data showing how patient cells actually respond to drugs.

Current AI for protein engineering relies on small public datasets like the PDB (~10,000 structures), causing models to "hallucinate" or default to known examples. This data bottleneck, orders of magnitude smaller than data used for LLMs, hinders the development of novel therapeutics.

The bottleneck for AI in drug development isn't the sophistication of the models but the absence of large-scale, high-quality biological data sets. Without comprehensive data on how drugs interact within complex human systems, even the best AI models cannot make accurate predictions.

Following the success of AlphaFold in predicting protein structures, Demis Hassabis says DeepMind's next grand challenge is creating a full AI simulation of a working cell. This 'virtual cell' would allow researchers to test hypotheses about drugs and diseases millions of times faster than in a physical lab.

Teaching AI Drug Discovery Physics Requires Energetic Data, Not Just Structures | RiffOn