Cuban identifies a massive, overlooked opportunity: acquiring the intellectual property (patents, data, designs) from millions of defunct businesses. This "dead IP" could be aggregated and sold at a high premium to foundational model companies desperate for unique training data.
The industry has already exhausted the public web data used to train foundational AI models, a point underscored by the phrase "we've already run out of data." The next leap in AI capability and business value will come from harnessing the vast, proprietary data currently locked behind corporate firewalls.
Public internet data has been largely exhausted for training AI models. The real competitive advantage and source for next-generation, specialized AI will be the vast, untapped reservoirs of proprietary data locked inside corporations, like R&D data from pharmaceutical or semiconductor companies.
Mark Cuban warns that patenting work makes it public, allowing any AI model to train on it instantly. To maintain a competitive data advantage, he suggests companies should increasingly rely on trade secrets, keeping their valuable IP out of the public domain and away from competitors' models.
Instead of building AI models, a company can create immense value by being 'AI adjacent'. The strategy is to focus on enabling good AI by solving the foundational 'garbage in, garbage out' problem. Providing high-quality, complete, and well-understood data is a critical and defensible niche in the AI value chain.
Since LLMs are commodities, sustainable competitive advantage in AI comes from leveraging proprietary data and unique business processes that competitors cannot replicate. Companies must focus on building AI that understands their specific "secret sauce."
The AI revolution may favor incumbents, not just startups. Large companies possess vast, proprietary datasets. If they quickly fine-tune custom LLMs with this data, they can build a formidable competitive moat that an AI startup, starting from scratch, cannot easily replicate.
To build a unique dataset without massive cost, target the aggregated, non-identifiable 'exhaust data' from software, payments, and telematics companies. These firms often undervalue this data, which they may have been deleting, and might provide it cheaply or exclusively.
As algorithms become more widespread, the key differentiator for leading AI labs is their exclusive access to vast, private data sets. XAI has Twitter, Google has YouTube, and OpenAI has user conversations, creating unique training advantages that are nearly impossible for others to replicate.
The initial AI boom was fueled by scraping the public internet. Cuban predicts the next phase will be dominated by exclusive data deals. Content owners, like medical journals, will protect their IP and auction it to the highest-bidding AI companies, creating valuable data silos.
YipitData had data on millions of companies but could only afford to process it for a few hundred public tickers due to high manual cleaning costs. AI and LLMs have now made it economically viable to tag and structure this messy, long-tail data at scale, creating massive new product opportunities.