Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

To create a predictive "virtual cell," data collection must shift from passive observation to active intervention. The strategy is to massively scale perturbation experiments (like Perturb-seq) across countless contexts and measure multi-modal responses, teaching the model cause and effect.

Related Insights

A convergence of DNA sequencing, CRISPR, and AI allows scientists to move beyond just understanding biology to actively intervening. Medicine is now programming cellular behavior by rewriting DNA, representing a "step function" leap in what's achievable for treating disease at its root cause.

The next leap in biotech moves beyond applying AI to existing data. CZI pioneers a model where 'frontier biology' and 'frontier AI' are developed in tandem. Experiments are now designed specifically to generate novel data that will ground and improve future AI models, creating a virtuous feedback loop.

Today's "virtual cell" models represent training data well but cannot predict outcomes for novel interventions. The next frontier is building models that generalize to serve as true predictive oracles for experiments that haven't yet been performed, a key focus for BioHub.

Xaira's core strategy involves creating massive, proprietary datasets that reveal causal biology. By systematically perturbing every gene in a cell to observe its effects, they generate unique training data for their models, quadrupling the world's supply of such information with a single publication.

AI models trained on descriptive data (e.g., RNA-seq) can classify cell states but fail to predict how to transition a diseased cell to a healthy one. True progress requires generating massive "causal" datasets that show the effects of specific genetic perturbations.

The primary obstacle to creating sophisticated AI models of cells isn't the AI itself, but the data. Existing datasets often perturb only one cellular variable at a time, failing to capture the complex interactions that arise from simultaneous changes. New platforms are needed to generate this multi-dimensional data.

The next frontier in preclinical research involves feeding multi-omics and spatial data from complex 3D cell models into AI algorithms. This synergy will enable a crucial shift from merely observing biological phenomena to accurately predicting therapeutic outcomes and patient responses.

To truly understand biological systems, data scale is less important than data quality. The most informative data comes from capturing the dynamic interactions of a system *while* it's being perturbed (e.g., by a drug), not from static snapshots of a system at rest.

Building biologically relevant AI is not a one-off process. It demands a continuous "lab in the loop" system where wet lab experiments generate proprietary data to train models, whose outputs are then physically tested in the lab. This iterative feedback cycle constantly refines the model's predictive accuracy.

While petabytes of observational DNA sequence data exist, it's insufficient for the next wave of AI. The key to creating powerful, functional models is generating causal data—from experiments that systematically test function—which is a current data bottleneck.

Building Predictive Cell Models Requires Scaling Interventional, Not Just Observational, Data | RiffOn