Unlike general AI which leverages vast, existing datasets, Noetik believes progress in biology requires designing and generating specific, high-quality data with foresight into the models that will be trained. They compare this to the intentional, decades-long creation of the PDB dataset for protein folding.
Noetik's core thesis is that the 95% failure rate in cancer trials isn't due to bad drug design, but an inability to identify the correct patient sub-population. Their models aim to solve this patient selection problem from the outset, rescuing potentially effective drugs.
Drawing an analogy from neuroscience, Noetik argues for a top-down modeling approach. Instead of building a perfect simulation of a single cell and scaling up, they model the functional interactions at the tissue level first. This abstraction is more likely to predict patient-level outcomes, which is the ultimate goal.
Instead of pursuing a purely academic goal of simulating every biochemical process, Noetik's "virtual cell" models are practical tools. They focus on understanding cell biology through heuristics that are useful for making drugs, like predicting a cell's transcriptome or protein expression in a specific context.
A major cause of clinical trial failure is that preclinical testing uses immortalized cancer cell lines cultured for decades. These cells have abnormal genomes and gene expressions that don't represent actual tumors, creating a massive translational gap that Noetik's patient-derived data aims to solve.
To bridge the gap between animal models and human trials, Noetik trains models on its human data and then runs inference on mouse histology (H&E) images. This allows them to predict human-relevant biology and gene expression directly from the mouse model, overcoming a key translational hurdle in drug development.
Noetik's $50M deal with GSK licenses their OctoVC foundation model, not a drug candidate or a collaborative project. This shifts the business model from bespoke services to a scalable software-like approach, allowing pharma partners to use the model across their entire pipeline and even fine-tune it on proprietary data.
To mitigate data variations caused by running experiments on different days (batch effects), Noetik employs a sophisticated arraying strategy. They take dozens of samples from a single tumor and distribute them across multiple, randomized arrays, ensuring each patient is represented in different batches for robust calibration and model training.
While Noetik's models are trained on complex, multimodal data like spatial transcriptomics, they are designed to run inference using only standard, ubiquitous H&E pathology slides. This creates a highly scalable and practical path to a clinical diagnostic without requiring expensive, novel assays for every patient.
When developing their Tario transformer model, Noetik discovered a key scaling behavior: larger, autoregressive models only outperform smaller ones when given a longer context window (i.e., seeing more tissue at once). This suggests that capturing broader spatial relationships is critical for learning complex biological patterns.
Demonstrating extreme conviction, Noetik invested a year and a half in lab setup, tumor sourcing, and data processing before having a dataset large enough to train its first models. This highlights the immense upfront investment and risk required for a data-first approach in bio-AI, where no off-the-shelf data exists.
