We scan new podcasts and send you the top 5 insights daily.
To truly understand biological systems, data scale is less important than data quality. The most informative data comes from capturing the dynamic interactions of a system *while* it's being perturbed (e.g., by a drug), not from static snapshots of a system at rest.
Foundational biological datasets, like the first Human Cell Atlas, take immense time and capital to create (10 years). However, this initial effort creates tooling and knowledge that allows subsequent, larger-scale projects to be completed exponentially faster and at a fraction of the cost.
The primary bottleneck for creating powerful foundation models in biology is the lack of clean, large-scale experimental data—orders of magnitude less than what's available for LLMs. This creates a major opportunity for "data foundries" that use robotic labs to generate high-quality biological data at scale.
AI models trained on descriptive data (e.g., RNA-seq) can classify cell states but fail to predict how to transition a diseased cell to a healthy one. True progress requires generating massive "causal" datasets that show the effects of specific genetic perturbations.
While AI excels where large, clean datasets exist (like protein folding), it struggles with modeling slow, progressive diseases like Alzheimer's or obesity. These are organ-level phenomena, and the necessary data doesn't exist yet. In vivo platforms are critical for generating this required foundational data.
Early researchers were overwhelmed by the massive, chaotic changes in gene expression in sepsis patients, terming it a "genomic storm." Inflammatics' founders viewed this complexity not as an obstacle but as a rich dataset. By applying advanced computational analysis, they identified specific, interpretable signals for diagnosis and prognosis.
The progress of AI in predicting cancer treatment is stalled not by algorithms, but by the data used to train them. Relying solely on static genetic data is insufficient. The critical missing piece is functional, contextual data showing how patient cells actually respond to drugs.
The next frontier in preclinical research involves feeding multi-omics and spatial data from complex 3D cell models into AI algorithms. This synergy will enable a crucial shift from merely observing biological phenomena to accurately predicting therapeutic outcomes and patient responses.
The bottleneck for AI in drug development isn't the sophistication of the models but the absence of large-scale, high-quality biological data sets. Without comprehensive data on how drugs interact within complex human systems, even the best AI models cannot make accurate predictions.
Applying AI to biology isn't just a big data problem. The training data must be structured for reinforcement learning. This means it must be complete (including negative results) and allow for a feedback loop where AI predictions are tested in the lab, and the results are used to refine the model.
Biomarkers provide value beyond predicting patient response. Their core function is to answer 'why' a treatment succeeded or failed. This explanatory power informs sequential therapy decisions and provides crucial scientific insights that advance the entire medical field, not just the individual patient's case.