To overcome the data scarcity problem for industrial AI, Siemens formed an alliance with competing German machine builders. These companies agreed to pool their operational data, trusting Siemens to build powerful, shared AI models that are more effective than any single company could create alone.
Rather than trying to predict specific geopolitical crises, Siemens builds resilience by creating separate technology stacks for different regions. For instance, its industrial AI for China is trained on Chinese LLMs, while its US counterpart uses American models, creating independent and compliant systems.
Public internet data has been largely exhausted for training AI models. The real competitive advantage and source for next-generation, specialized AI will be the vast, untapped reservoirs of proprietary data locked inside corporations, like R&D data from pharmaceutical or semiconductor companies.
The primary barrier to AI in drug discovery is the lack of large, high-quality training datasets. The emergence of federated learning platforms, which protect raw data while collectively training models, is a critical and undersung development for advancing the field.
The future of valuable AI lies not in models trained on the abundant public internet, but in those built on scarce, proprietary data. For fields like robotics and biology, this data doesn't exist to be scraped; it must be actively created, making the data generation process itself the key competitive moat.
Before deploying AI across a business, companies must first harmonize data definitions, especially after mergers. When different units call a "raw lead" something different, AI models cannot function reliably. This foundational data work is a critical prerequisite for moving beyond proofs-of-concept to scalable AI solutions.
Roland Bush asserts that foundational LLMs alone are insufficient and dangerous for industrial applications due to their unreliability. He argues that achieving the required 95%+ accuracy depends on augmenting these models with highly specific, proprietary data from machines, operations, and past fixes.
As AI's bottleneck shifts from compute to data, the key advantage becomes low-cost data collection. Industrial incumbents have a built-in moat by sourcing messy, multimodal data from existing operations—a feat startups cannot replicate without paying a steep marginal cost for each data point.
Major AI labs like OpenAI and Anthropic are partnering with competing cloud and chip providers (Amazon, Google, Microsoft). This creates a complex web of alliances where rivals become partners, spreading risk and ensuring access to the best available technology, regardless of primary corporate allegiances.
According to Salesforce's AI chief, the primary challenge for large companies deploying AI is harmonizing data across siloed departments, like sales and marketing. AI cannot operate effectively without connected, unified data, making data integration the crucial first step before any advanced AI implementation.
Contrary to early narratives, a proprietary dataset is not the primary moat for AI applications. True, lasting defensibility is built by deeply integrating into an industry's ecosystem—connecting different stakeholders, leveraging strategic partnerships, and using funding velocity to build the broadest product suite.