Customizing a base model with proprietary data is only effective if a company possesses a massive corpus. At least 10 billion high-quality tokens are needed *after* aggressive deduplication and filtering. This high threshold means the strategy is only viable for the largest corporations, a much higher bar than most businesses realize.
The industry has already exhausted the public web data used to train foundational AI models, a point underscored by the phrase "we've already run out of data." The next leap in AI capability and business value will come from harnessing the vast, proprietary data currently locked behind corporate firewalls.
For specialized, high-stakes tasks like insurance underwriting, enterprises will favor smaller, on-prem models fine-tuned on proprietary data. These models can be faster, more accurate, and more secure than general-purpose frontier models, creating a lasting market for custom AI solutions.
The notion of building a business as a 'thin wrapper' around a foundational model like GPT is flawed. Truly defensible AI products, like Cursor, build numerous specific, fine-tuned models to deeply understand a user's domain. This creates a data and performance moat that a generic model cannot easily replicate, much like Salesforce was more than just a 'thin wrapper' on a database.
For years, access to compute was the primary bottleneck in AI development. Now, as public web data is largely exhausted, the limiting factor is access to high-quality, proprietary data from enterprises and human experts. This shifts the focus from building massive infrastructure to forming data partnerships and expertise.
Off-the-shelf AI models can only go so far. The true bottleneck for enterprise adoption is "digitizing judgment"—capturing the unique, context-specific expertise of employees within that company. A document's meaning can change entirely from one company to another, requiring internal labeling.
Basic supervised fine-tuning (SFT) only adjusts a model's style. The real unlock for enterprises is reinforcement fine-tuning (RFT), which leverages proprietary datasets to create state-of-the-art models for specific, high-value tasks, moving beyond mere 'tone improvements.'
A critical learning at LinkedIn was that pointing an AI at an entire company drive for context results in poor performance and hallucinations. The team had to manually curate "golden examples" and specific knowledge bases to train agents effectively, as the AI couldn't discern quality on its own.
If a company and its competitor both ask a generic LLM for strategy, they'll get the same answer, erasing any edge. The only way to generate unique, defensible strategies is by building evolving models trained on a company's own private data.
As algorithms become more widespread, the key differentiator for leading AI labs is their exclusive access to vast, private data sets. XAI has Twitter, Google has YouTube, and OpenAI has user conversations, creating unique training advantages that are nearly impossible for others to replicate.
Anthropic maintains a competitive edge by physically acquiring and digitizing thousands of old books, creating a massive, proprietary dataset of high-quality text. This multi-year effort to build a unique data library is difficult to replicate and may contribute to the distinct quality of its Claude models.