Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

As enterprise spend on AI workflows explodes, companies will create custom evaluation benchmarks (evals) for each specific use case. These evals act as a system of record to hot-swap between different models based on price-performance, enabling perfect competition and ultimately commoditizing the API layer.

Related Insights

Faced with rising costs from proprietary labs, sophisticated enterprise clients are building internal evaluation and routing systems. This allows them to use cheaper, open-source models for less complex tasks, optimizing for both cost and performance.

Enterprises are currently overspending on tokens by sending all queries to the most powerful LLMs. A new software category will emerge to intelligently route requests to smaller, cheaper models when possible, creating a critical efficiency and cost-saving layer between companies and foundational model providers.

Standardized benchmarks for AI models are largely irrelevant for business applications. Companies need to create their own evaluation systems tailored to their specific industry, workflows, and use cases to accurately assess which new model provides a tangible benefit and ROI.

The assumption that enterprise API spending on AI models creates a strong moat is flawed. In reality, businesses can and will easily switch between providers like OpenAI, Google, and Anthropic. This makes the market a commodity battleground where cost and on-par performance, not loyalty, will determine the winners.

The common practice of model distillation suggests that AI capabilities will eventually be commoditized. As smaller models can cheaply mimic larger ones, differentiation will shift away from raw performance to product integration and price, likely triggering a massive price war among providers.

Enterprises will shift from relying on a single large language model to using orchestration platforms. These platforms will allow them to 'hot swap' various models—including smaller, specialized ones—for different tasks within a single system, optimizing for performance, cost, and use case without being locked into one provider.

As enterprises deploy agents for critical tasks like RFP generation or invoice processing, they will require dedicated evaluation frameworks and teams. This will create a massive new market for agent observability and eval tools, moving them beyond AI-native companies to the broader enterprise.

The AI value chain flows from hardware (NVIDIA) to apps, with LLM providers currently capturing most of the margin. The long-term viability of app-layer businesses depends on a competitive model layer. This competition drives down API costs, preventing model providers from having excessive pricing power and allowing apps to build sustainable businesses.

The rapid release of new AI models makes it crucial for companies to move beyond industry benchmarks. Developing internal evaluation systems ("evals") is necessary to test and determine which model performs best for unique, high-value business use cases, as model choice is becoming extremely important.

As foundational AI models become commoditized 'intelligence utilities,' the economic value moves up the stack. Orchestrators like OpenClaw, which can intelligently route tasks to the most efficient model based on cost or use case, are positioned to capture the margin that the underlying model providers cannot.

Enterprises Will Use Custom Evals to Commoditize the Foundation Model API Layer | RiffOn