Noetik Spent 18 Months Generating Data Before Training Its First Foundational Model

Related Insights

Biological Data Generation Follows a "Slow, Then Fast" Cycle of Compounding Progress

Foundational biological datasets, like the first Human Cell Atlas, take immense time and capital to create (10 years). However, this initial effort creates tooling and knowledge that allows subsequent, larger-scale projects to be completed exponentially faster and at a fraction of the cost.

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·4 months ago

AI Won't Revolutionize Biology Until Biology Provides Better Data

The bottleneck for AI in drug discovery is not the algorithm but the lack of high-quality, large-scale biological data. New platforms are needed to generate this necessary "substrate" for AI models to learn from, challenging the narrative that better models alone are the solution.

10x Genomics today is announcing Atera, its new in situ spatial transcriptomics platform

BiotechTV - News·2 months ago

Top Biotech Labs Now Design Experiments to Train AI, Not Just Answer Questions

The next leap in biotech moves beyond applying AI to existing data. CZI pioneers a model where 'frontier biology' and 'frontier AI' are developed in tandem. Experiments are now designed specifically to generate novel data that will ground and improve future AI models, creating a virtuous feedback loop.

Priscilla Chan and Mark Zuckerberg: Frontier AI + Virtual Biology To Solve All Diseases

Latent Space: The AI Engineer Podcast·7 months ago

Biology AI Models Are Stalled by Data Scarcity, Not by Algorithms

The primary bottleneck for creating powerful foundation models in biology is the lack of clean, large-scale experimental data—orders of magnitude less than what's available for LLMs. This creates a major opportunity for "data foundries" that use robotic labs to generate high-quality biological data at scale.

CitriniPocalypse, Dot Com Lore, Gene-Edited Polo Horses | Alap Shah, Will Brown, Michelle Lee, Mike Annunziata

TBPN·3 months ago

Biotech Firms Create Synthetic Data to Overcome Public Database Limitations

To break the data bottleneck in AI protein engineering, companies now generate massive synthetic datasets. By creating novel "synthetic epitopes" and measuring their binding, they can produce thousands of validated positive and negative training examples in a single experiment, massively accelerating model development.

220: From 10,000 Structures to 1.8 Billion Interactions: Breaking the Data Bottleneck to Engineer Efficacious Therapeutics with Troy Lionberger - Part 2

Smart Biotech Scientist | Master Bioprocess CMC Development, Biologics Manufacturing & Scale-up, Cell Culture Innovation·5 months ago

Foundational Scientific Datasets Follow a 'Slow Then Fast' Innovation Cycle

Building the first large-scale biological datasets, like the Human Cell Atlas, is a decade-long, expensive slog. However, this foundational work creates tools and knowledge that enable subsequent, larger-scale projects to be completed exponentially faster and cheaper, proving a non-linear path to discovery.

Priscilla Chan and Mark Zuckerberg: Frontier AI + Virtual Biology To Solve All Diseases

Latent Space: The AI Engineer Podcast·7 months ago

Xaira's Edge Comes From Generating Proprietary Causal Data, Not Just Applying AI

Xaira's core strategy involves creating massive, proprietary datasets that reveal causal biology. By systematically perturbing every gene in a cell to observe its effects, they generate unique training data for their models, quadrupling the world's supply of such information with a single publication.

What Xaira is building after its $1B fundraise

The Top Line·3 months ago

Scarce, Actively Generated Data Is the New Moat for Robotics and Biology AI

The future of valuable AI lies not in models trained on the abundant public internet, but in those built on scarce, proprietary data. For fields like robotics and biology, this data doesn't exist to be scraped; it must be actively created, making the data generation process itself the key competitive moat.

Josh Wolfe & Brett McGurk – Venture, Geopolitics, and the Next Frontier (EP.476)

Capital Allocators – Inside the Institutional Investment Industry·6 months ago

'Tech Bio' Startups Build Proprietary AI-Ready Databases Before Seeking Drug Targets

A new 'Tech Bio' model inverts traditional biotech by first building a novel, highly structured database designed for AI analysis. Only after this computational foundation is built do they use it to identify therapeutic targets, creating a data-first moat before any lab work begins.

Netflix’s Warner Bros. Play to Beat YouTube, Ex-OpenAI Head of Sales on Selling AI | Jan 21, 2026

The Information's TITV·4 months ago

Noetik Argues Intentional Data Generation Trumps Brute-Force Collection in Biology AI

Unlike general AI which leverages vast, existing datasets, Noetik believes progress in biology requires designing and generating specific, high-quality data with foresight into the models that will be trained. They compare this to the intentional, decades-long creation of the PDB dataset for protein folding.

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

Latent Space: The AI Engineer Podcast·2 months ago

Get your free personalized podcast brief

Related Insights