The demand for AI inference is insatiable. As models become cheaper and more efficient, developers and businesses find more ways to embed intelligence, creating a perpetually growing market. Even with AGI, the core need will be running inference.
Despite the hype around enterprise AI, the vast majority of current inference workloads are driven by new, AI-native application companies. This indicates that the broader enterprise adoption market is still in its infancy, representing a massive future growth opportunity.
Instead of selling directly to enterprises initially, AI infrastructure companies can learn enterprise needs by proxy. By serving fast-moving AI startups who sell to the enterprise, they receive a "translation" of requirements for data retention, latency, and transparency, preparing them for that market.
At scale, companies rarely deploy open-source models "off the shelf." Instead, virtually all production workloads involve custom modifications. This can be post-training with proprietary data to improve quality or compiling and quantizing the model to enhance performance and reduce cost.
Startups can compete with large AI labs by capturing unique user interaction data from specialized workflows. This proprietary "user signal" enables post-training of models for specific tasks, creating a defensible advantage that labs, lacking that specific context, cannot easily replicate.
The first step for an AI startup is to prove value using the best off-the-shelf models, even if they are expensive. Investing in custom models and post-training is a form of optimization that should only happen after product-market fit is established and there is a clear user signal to optimize for.
Providing GPUs-as-a-Service is not a durable business because customers can easily switch providers. The key to customer retention and high net dollar retention (NDR) is the software layer built on top of the hardware. This software, which handles the complexities of inference, creates the actual stickiness.
The widely discussed GPU supply crunch is only half the problem. There's a severe shortage of suppliers who can operate data centers with the high reliability and SLAs required for mission-critical inference. Out of many providers, only a handful meet the "gold tier" for operational excellence.
When running AI inference at extreme scale, the most surprising and difficult challenges are often not unique to LLMs. Instead, they are classic distributed systems problems—like kernel panics caused by logging overload—that only manifest under immense load. The immaturity of runtimes exacerbates these issues.
Accessing next-generation GPUs at scale is no longer a simple purchase. The market now demands three-to-five-year commitments with a significant portion (20-30%) of the total contract value paid upfront. This makes a company's cost of capital a critical competitive factor in acquiring compute capacity.
