Achieving state-of-the-art AI performance requires a massive, bespoke data generation process. This involves thousands of human experts—from legal specialists to management consultants—creating specific examples, rubrics, and chain-of-thought explanations, forming a new and rapidly growing data industry that is the true engine of progress.
While training AI is vastly less data-efficient than training a human, it remains a winning economic strategy. Unlike humans, AI training can be massively parallelized, and the resulting skills can be amortized across billions of simultaneous user sessions, making the inefficient process highly profitable and scalable.
The rapid progress of open-source models is evidence that data is the primary driver of AI capability, not proprietary architectures or training tricks. Data can be easily distilled from public APIs, allowing competitors to quickly close the gap with frontier models, which would be impossible if secret architectural tricks were the main advantage.
According to scaling laws, increasing model size offers minimal improvement to data efficiency. Even an infinitely large model would only reduce data needs by about 10x, a trivial amount compared to the thousands-to-millions-fold efficiency gap between AIs and humans. This suggests current architectures are on the wrong scaling curve for true intelligence.
The argument that evolution 'pre-trained' humans, excusing AI's data needs, is flawed. The human genome is too small to store a complex neural network's parameters. A better analogy is that evolution found the right hyperparameters and loss functions, while our brain's 'weights' are learned from scratch in our lifetime, making AI's data hunger even more stark.
