We scan new podcasts and send you the top 5 insights daily.
Unlike protein folding, which benefited from the CASP competition's experimental ground truth data, materials science lacks large-scale, high-quality experimental datasets. Existing data often comes from low-fidelity simulations, meaning even the best AI models are trained on imperfect information, hindering a major breakthrough.
Even the most advanced AI model can't accelerate science without practical, real-world data. The current bottleneck is often logistical—knowing reagent lead times, lab inventory, and costs. Superior model intelligence is less critical than having access to this operational context.
The traditional scientific method in materials science—hypothesize, experiment, learn—is being replaced. AI enables a new paradigm: treating the vast space of all possible molecules as a searchable database. Scientists can now query for materials with desired properties, radically accelerating discovery.
The primary bottleneck for creating powerful foundation models in biology is the lack of clean, large-scale experimental data—orders of magnitude less than what's available for LLMs. This creates a major opportunity for "data foundries" that use robotic labs to generate high-quality biological data at scale.
DE Shaw Research (DESRES) invested heavily in custom silicon for molecular dynamics (MD) to solve protein folding. In contrast, DeepMind's AlphaFold, using ML on experimental data, solved it on commodity hardware. This demonstrates data-driven approaches can be vastly more effective than brute-force simulation for complex scientific problems.
Foundation models can't be trained for physics using existing literature because the data is too noisy and lacks published negative results. A physical lab is needed to generate clean data and capture the learning signal from failed experiments, which is a core thesis for Periodic Labs.
Models like AlphaFold don't solve protein folding from physics alone. They heavily rely on co-evolutionary data, where correlated mutations across species provide strong hints about which amino acids are physically close. This dramatically constrains the search space for the final structure.
Early AI models advanced by scraping web text and code. The next revolution, especially in "AI for science," requires overcoming a major hurdle: consolidating and formatting the world's vast but fragmented scientific data across disciplines like chemistry and materials science for model training.
Despite significant hype, new "foundation models" for materials science may not be ready to replace traditional physics-based methods. In practice, one prominent model was only five times faster than existing GPU-accelerated calculations and proved unreliable, with molecules nonsensically falling apart, highlighting the need for more rigorous evaluation.
Current AI for protein engineering relies on small public datasets like the PDB (~10,000 structures), causing models to "hallucinate" or default to known examples. This data bottleneck, orders of magnitude smaller than data used for LLMs, hinders the development of novel therapeutics.
The bottleneck for AI in drug development isn't the sophistication of the models but the absence of large-scale, high-quality biological data sets. Without comprehensive data on how drugs interact within complex human systems, even the best AI models cannot make accurate predictions.