AI models trained on descriptive data (e.g., RNA-seq) can classify cell states but fail to predict how to transition a diseased cell to a healthy one. True progress requires generating massive "causal" datasets that show the effects of specific genetic perturbations.

Related Insights

To evolve AI from pattern matching to understanding physics for protein engineering, structural data is insufficient. Models need physical parameters like Gibbs free energy (delta-G), obtainable from affinity measurements, to become truly predictive and transformative for therapeutic development.

Simple cell viability screens fail to identify powerful drug combinations where each component is ineffective on its own. AI can predict these synergies, but only if trained on mechanistic data that reveals how cells rewire their internal pathways in response to a drug.

While AI promises to design therapeutics computationally, it doesn't eliminate the need for physical lab work. Even if future models require no training data, their predicted outputs must be experimentally validated. This ensures a continuous, inescapable cycle where high-throughput data generation remains critical for progress.

The primary barrier to AI in drug discovery is the lack of large, high-quality training datasets. The emergence of federated learning platforms, which protect raw data while collectively training models, is a critical and undersung development for advancing the field.

Despite AI's power, 90% of drugs fail in clinical trials. John Jumper argues the bottleneck isn't finding molecules that target proteins, but our fundamental lack of understanding of disease causality, like with Alzheimer's, which is a biology problem, not a technology one.

The progress of AI in predicting cancer treatment is stalled not by algorithms, but by the data used to train them. Relying solely on static genetic data is insufficient. The critical missing piece is functional, contextual data showing how patient cells actually respond to drugs.

Current AI for protein engineering relies on small public datasets like the PDB (~10,000 structures), causing models to "hallucinate" or default to known examples. This data bottleneck, orders of magnitude smaller than data used for LLMs, hinders the development of novel therapeutics.

The next frontier in preclinical research involves feeding multi-omics and spatial data from complex 3D cell models into AI algorithms. This synergy will enable a crucial shift from merely observing biological phenomena to accurately predicting therapeutic outcomes and patient responses.

The bottleneck for AI in drug development isn't the sophistication of the models but the absence of large-scale, high-quality biological data sets. Without comprehensive data on how drugs interact within complex human systems, even the best AI models cannot make accurate predictions.

A major frustration in genetics is finding 'variants of unknown significance' (VUS)—genetic anomalies with no known effect. AI models promise to simulate the impact of these unique variants on cellular function, moving medicine from reactive diagnostics to truly personalized, predictive health.