We scan new podcasts and send you the top 5 insights daily.
A major misconception is that general-purpose Large Language Models (LLMs) can be readily applied to complex biological problems. Biological data, like RNA sequencing, constitutes a unique language that requires custom-built foundation models, not simply fine-tuning of existing LLMs.
Unlike image recognition or NLP, clinical trial data possesses a unique and complex mathematical geometry. According to Dr. Juraji, this means generic AI models are insufficient. Solving trial failures requires specialized AI built to navigate this specific, difficult data landscape.
Powerful AI models for biology exist, but the industry lacks a breakthrough user interface—a "ChatGPT for science"—that makes them accessible, trustworthy, and integrated into wet lab scientists' workflows. This adoption and translation problem is the biggest hurdle, not the raw capability of the AI models themselves.
A classical, bottom-up simulation of a cell is infeasible, according to John Jumper. He sees the more practical path forward as fusing specialized models like AlphaFold with the broad reasoning of LLMs to create hybrid systems that understand biology.
The next major AI breakthrough will come from applying generative models to complex systems beyond human language, such as biology. By treating biological processes as a unique "language," AI could discover novel therapeutics or research paths, leading to a "Move 37" moment in science.
The primary bottleneck for creating powerful foundation models in biology is the lack of clean, large-scale experimental data—orders of magnitude less than what's available for LLMs. This creates a major opportunity for "data foundries" that use robotic labs to generate high-quality biological data at scale.
Unlike math or code with cheap, fast rewards, clinically valuable biology problems lack easily verifiable ground truths. This makes it difficult to create the rapid reinforcement learning loops that drive explosive AI progress in other fields.
Applying AI to biology isn't just a big data problem. The training data must be structured for reinforcement learning. This means it must be complete (including negative results) and allow for a feedback loop where AI predictions are tested in the lab, and the results are used to refine the model.
While acknowledging the power of Large Language Models (LLMs) for linear biological data like protein sequences, CZI's strategy recognizes that biological processes are highly multidimensional and non-linear. The organization is focused on developing new types of AI that can accurately model this complexity, moving beyond the one-dimensional, sequential nature of language-based models.
While petabytes of observational DNA sequence data exist, it's insufficient for the next wave of AI. The key to creating powerful, functional models is generating causal data—from experiments that systematically test function—which is a current data bottleneck.
Traditional science failed to create equations for complex biological systems because biology is too "bespoke." AI succeeds by discerning patterns from vast datasets, effectively serving as the "language" for modeling biology, much like mathematics is the language of physics.