Current AI for protein engineering relies on small public datasets like the PDB (~10,000 structures), causing models to "hallucinate" or default to known examples. This data bottleneck, orders of magnitude smaller than data used for LLMs, hinders the development of novel therapeutics.
To break the data bottleneck in AI protein engineering, companies now generate massive synthetic datasets. By creating novel "synthetic epitopes" and measuring their binding, they can produce thousands of validated positive and negative training examples in a single experiment, massively accelerating model development.
To evolve AI from pattern matching to understanding physics for protein engineering, structural data is insufficient. Models need physical parameters like Gibbs free energy (delta-G), obtainable from affinity measurements, to become truly predictive and transformative for therapeutic development.
Tackling monumental challenges, like creating a biologic effective against 800+ HIV variants, is not a single-shot success. It requires multiple iterations on an advanced engineering platform. Each cycle of design, measurement, and learning progressively refines the molecule, making previously impossible therapeutic goals achievable.
Biotech companies create more value by focusing on de-risking molecules for clinical success, not engineering them from scratch. Specialized platforms can create molecules faster and more reliably, allowing developers to focus their core competency on advancing de-risked assets through the pipeline.
