While OpenFold trains on public datasets, the pre-processing and distillation to make the data usable requires massive compute resources. This "data prep" phase can cost over $15 million, creating a significant, non-obvious barrier to entry for academic labs and startups wanting to build foundational models.
The industry has already exhausted the public web data used to train foundational AI models, a point underscored by the phrase "we've already run out of data." The next leap in AI capability and business value will come from harnessing the vast, proprietary data currently locked behind corporate firewalls.
The "Bitter Lesson" is not just about using more compute, but leveraging it scalably. Current LLMs are inefficient because they only learn during a discrete training phase, not during deployment where most computation occurs. This reliance on a special, data-intensive training period is not a scalable use of computational resources.
Open source AI models can't improve in the same decentralized way as software like Linux. While the community can fine-tune and optimize, the primary driver of capability—massive-scale pre-training—requires centralized compute resources that are inherently better suited to commercial funding models.
Public internet data has been largely exhausted for training AI models. The real competitive advantage and source for next-generation, specialized AI will be the vast, untapped reservoirs of proprietary data locked inside corporations, like R&D data from pharmaceutical or semiconductor companies.
Creating frontier AI models is incredibly expensive, yet their value depreciates rapidly as they are quickly copied or replicated by lower-cost open-source alternatives. This forces model providers to evolve into more defensible application companies to survive.
Even with optimistic HSBC projections for massive revenue growth by 2030, OpenAI faces a $207 billion funding shortfall to cover its data center and compute commitments. This staggering number indicates that its current business model is not viable at scale and will require either renegotiating massive contracts or finding an entirely new monetization strategy.
For years, access to compute was the primary bottleneck in AI development. Now, as public web data is largely exhausted, the limiting factor is access to high-quality, proprietary data from enterprises and human experts. This shifts the focus from building massive infrastructure to forming data partnerships and expertise.
According to Stanford's Fei-Fei Li, the central challenge facing academic AI isn't the rise of closed, proprietary models. The more pressing issue is a severe imbalance in resources, particularly compute, which cripples academia's ability to conduct its unique mission of foundational, exploratory research.
When LLMs became too computationally expensive for universities, AI research pivoted. Academics flocked to areas like 3D vision, where breakthroughs like NeRF allowed for state-of-the-art results on a single GPU. This resource constraint created a vibrant, accessible, and innovative research ecosystem away from giant models.
Paying a single AI researcher millions is rational when they're running experiments on compute clusters worth tens of billions. A researcher with the right intuition can prevent wasting billions on failed training runs, making their high salary a rounding error compared to the capital they leverage.