We scan new podcasts and send you the top 5 insights daily.
A new market has emerged where defunct startups sell their entire operational histories—including codebases, internal communications, and go-to-market data—to AI labs and data brokers. This creates a new form of salvage value, turning years of failed effort into a valuable corpus for training next-generation models.
The industry has already exhausted the public web data used to train foundational AI models, a point underscored by the phrase "we've already run out of data." The next leap in AI capability and business value will come from harnessing the vast, proprietary data currently locked behind corporate firewalls.
Public internet data has been largely exhausted for training AI models. The real competitive advantage and source for next-generation, specialized AI will be the vast, untapped reservoirs of proprietary data locked inside corporations, like R&D data from pharmaceutical or semiconductor companies.
As powerful AI models make synthesizing public information trivial, the value of that data diminishes. AI platform RowSpace's thesis is that a firm's only defensible advantage lies in its decades of private data, accumulated judgment, and institutional memory. Their product is built to unlock this internal alpha.
Turing operates in two markets: providing AI services to enterprises and training data to frontier labs. Serving enterprises reveals where models break in practice (e.g., reading multi-page PDFs). This knowledge allows Turing to create targeted, valuable datasets to sell back to the model creators, creating a powerful feedback loop.
Cuban identifies a massive, overlooked opportunity: acquiring the intellectual property (patents, data, designs) from millions of defunct businesses. This "dead IP" could be aggregated and sold at a high premium to foundational model companies desperate for unique training data.
Ambitious AI projects may fail their primary goal but still produce valuable secondary assets. An attempt to predict memory prices with an LLM failed, but the automated data gathering process created a first-of-its-kind historical analysis dashboard, which proved to be a more valuable outcome.
With public data exhausted, AI companies are seeking proprietary datasets. After being rejected by established firms wary of sharing their 'crown jewels,' these labs are now acquiring the codebases of failed startups for tens of thousands of dollars as a novel source of high-quality training data.
As AI commoditizes software creation, the primary source of sustainable value shifts from the software itself to the unique, high-quality data that AI agents use for decision-making. Businesses must re-center their strategy around data as the core asset.
The initial AI boom was fueled by scraping the public internet. Cuban predicts the next phase will be dominated by exclusive data deals. Content owners, like medical journals, will protect their IP and auction it to the highest-bidding AI companies, creating valuable data silos.
Haystack's "Big Token" thesis posits that large AI foundation models (like OpenAI) will acquire startups not for their applications, but for their unique, proprietary data sets ("tokens"). This mirrors the Big Pharma model of buying smaller biotech firms for their R&D and drug assets.