We scan new podcasts and send you the top 5 insights daily.
The internet is an insufficient training ground for scientific AI because most crucial information—including failed experiments, negative data, and nuanced procedural details—is never published. This undocumented knowledge, what scientists call "good hands," represents a major data bottleneck for building truly intelligent scientific models.
Even the most advanced AI model can't accelerate science without practical, real-world data. The current bottleneck is often logistical—knowing reagent lead times, lab inventory, and costs. Superior model intelligence is less critical than having access to this operational context.
Unlike language models trained on the internet, AI for materials science overcomes data scarcity and unreliability (e.g., conflicting literature) with a closed loop. The system actively directs experiments, analyzes grounded results for patterns, and uses that new data to drive the next cycle.
Public internet data has been largely exhausted for training AI models. The real competitive advantage and source for next-generation, specialized AI will be the vast, untapped reservoirs of proprietary data locked inside corporations, like R&D data from pharmaceutical or semiconductor companies.
The primary bottleneck for creating powerful foundation models in biology is the lack of clean, large-scale experimental data—orders of magnitude less than what's available for LLMs. This creates a major opportunity for "data foundries" that use robotic labs to generate high-quality biological data at scale.
Foundation models can't be trained for physics using existing literature because the data is too noisy and lacks published negative results. A physical lab is needed to generate clean data and capture the learning signal from failed experiments, which is a core thesis for Periodic Labs.
AI performs poorly in areas where expertise is based on unwritten 'taste' or intuition rather than documented knowledge. If the correct approach doesn't exist in training data or isn't explicitly provided by human trainers, models will inevitably struggle with that particular problem.
Early AI models advanced by scraping web text and code. The next revolution, especially in "AI for science," requires overcoming a major hurdle: consolidating and formatting the world's vast but fragmented scientific data across disciplines like chemistry and materials science for model training.
Current AI for protein engineering relies on small public datasets like the PDB (~10,000 structures), causing models to "hallucinate" or default to known examples. This data bottleneck, orders of magnitude smaller than data used for LLMs, hinders the development of novel therapeutics.
Applying AI to biology isn't just a big data problem. The training data must be structured for reinforcement learning. This means it must be complete (including negative results) and allow for a feedback loop where AI predictions are tested in the lab, and the results are used to refine the model.
The founder of AI and robotics firm Medra argues that scientific progress is not limited by a lack of ideas or AI-generated hypotheses. Instead, the critical constraint is the physical capacity to test these ideas and generate high-quality data to train better AI models.