The rumored acquisition of Pinterest by OpenAI is driven by its 200 billion user-tagged images, a 'goldmine' for AI training. This demonstrates that large, well-structured datasets are becoming critical strategic assets and key drivers for M&A activity in the AI sector.

Related Insights

The industry has already exhausted the public web data used to train foundational AI models, a point underscored by the phrase "we've already run out of data." The next leap in AI capability and business value will come from harnessing the vast, proprietary data currently locked behind corporate firewalls.

Investments in OpenAI from giants like Amazon and Microsoft are strategic moves to embed the AI leader within their ecosystems. This is evidenced by deals requiring OpenAI to use the investors' proprietary processors and cloud infrastructure, securing technological dependency.

Cuban identifies a massive, overlooked opportunity: acquiring the intellectual property (patents, data, designs) from millions of defunct businesses. This "dead IP" could be aggregated and sold at a high premium to foundational model companies desperate for unique training data.

For years, access to compute was the primary bottleneck in AI development. Now, as public web data is largely exhausted, the limiting factor is access to high-quality, proprietary data from enterprises and human experts. This shifts the focus from building massive infrastructure to forming data partnerships and expertise.

With public data exhausted, AI companies are seeking proprietary datasets. After being rejected by established firms wary of sharing their 'crown jewels,' these labs are now acquiring the codebases of failed startups for tens of thousands of dollars as a novel source of high-quality training data.

Point-solution SaaS products are at a massive disadvantage in the age of AI because they lack the broad, integrated dataset needed to power effective features. Bundled platforms that 'own the mine' of data are best positioned to win, as AI can perform magic when it has access to a rich, semantic data layer.

As algorithms become more widespread, the key differentiator for leading AI labs is their exclusive access to vast, private data sets. XAI has Twitter, Google has YouTube, and OpenAI has user conversations, creating unique training advantages that are nearly impossible for others to replicate.

The initial AI boom was fueled by scraping the public internet. Cuban predicts the next phase will be dominated by exclusive data deals. Content owners, like medical journals, will protect their IP and auction it to the highest-bidding AI companies, creating valuable data silos.

YipitData had data on millions of companies but could only afford to process it for a few hundred public tickers due to high manual cleaning costs. AI and LLMs have now made it economically viable to tag and structure this messy, long-tail data at scale, creating massive new product opportunities.

OpenAI's move into healthcare is not just about applying LLMs to medicine. By acquiring Torch, it is tackling the core problem of fragmented health data. Torch was built as a "context engine" to unify scattered records, creating the comprehensive dataset needed for AI to provide meaningful health insights.