We scan new podcasts and send you the top 5 insights daily.
The public database of protein structures (PDB) is small and grows slowly. To train more powerful models, Genesis leverages physics simulations to model small molecule behavior, creating a large, high-quality synthetic dataset that isn't possible for more complex protein-protein interactions.
To evolve AI from pattern matching to understanding physics for protein engineering, structural data is insufficient. Models need physical parameters like Gibbs free energy (delta-G), obtainable from affinity measurements, to become truly predictive and transformative for therapeutic development.
Moving beyond simulation, Genesis uses a cycle where their AI proposes molecules, a pharma partner synthesizes and tests them in a wet lab, and the experimental outcomes are used as feedback to retrain the generative model. This is akin to RLHF but with physical experiments.
Similar to how an LLM uses a 'chain of thought' to reason, Genesis's model 'thinks' by iteratively refining an in-memory representation of a crystal structure. This process is guided by physics-based principles, significantly improving the final prediction's accuracy.
The primary bottleneck for creating powerful foundation models in biology is the lack of clean, large-scale experimental data—orders of magnitude less than what's available for LLMs. This creates a major opportunity for "data foundries" that use robotic labs to generate high-quality biological data at scale.
To break the data bottleneck in AI protein engineering, companies now generate massive synthetic datasets. By creating novel "synthetic epitopes" and measuring their binding, they can produce thousands of validated positive and negative training examples in a single experiment, massively accelerating model development.
Current AI for protein engineering relies on small public datasets like the PDB (~10,000 structures), causing models to "hallucinate" or default to known examples. This data bottleneck, orders of magnitude smaller than data used for LLMs, hinders the development of novel therapeutics.
Static data scraped from the web is becoming less central to AI training. The new frontier is "dynamic data," where models learn through trial-and-error in synthetic environments (like solving math problems), effectively creating their own training material via reinforcement learning.
ProPhet's strategy is to focus on 'hard-to-drug' proteins, which are often avoided because they lack the structural data required for traditional discovery. Because ProPhet's AI model needs very little protein information to predict interactions, this data scarcity becomes a competitive advantage.
Unlike general AI which leverages vast, existing datasets, Noetik believes progress in biology requires designing and generating specific, high-quality data with foresight into the models that will be trained. They compare this to the intentional, decades-long creation of the PDB dataset for protein folding.
Generative AI alone designs proteins that look correct on paper but often fail in the lab. DenovAI adds a physics layer to simulate molecular dynamics—the "jiggling and wiggling"—which weeds out false positives by modeling how proteins actually interact in the real world.