Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

OpenAI's GPT-5.5 is more expensive per token, but a new evaluation framework is emerging. The key metric isn't raw cost, but the model's efficiency in solving a problem. This 'intelligence per dollar' reframes cost analysis around performance and compute, where more expensive models can be cheaper overall if they solve tasks more efficiently.

Related Insights

The primary threat from competitors like Google may not be a superior model, but a more cost-efficient one. Google's Gemini 3 Flash offers "frontier-level intelligence" at a fraction of the cost. This shifts the competitive battleground from pure performance to price-performance, potentially undermining business models built on expensive, large-scale compute.

It's counterintuitive, but using a more expensive, intelligent model like Opus 4.5 can be cheaper than smaller models. Because the smarter model is more efficient and requires fewer interactions to solve a problem, it ends up using fewer tokens overall, offsetting its higher per-token price.

When evaluating AI agents, the total cost of task completion is what matters. A model with a higher per-token cost can be more economical if it resolves a user's query in fewer turns than a cheaper, less capable model. This makes "number of turns" a primary efficiency metric.

The release of models like Sonnet 4.6 shows that the industry is moving beyond singular 'state-of-the-art' benchmarks. The conversation now focuses on a more practical, multi-factor evaluation. Teams now analyze a model's specific capabilities, cost, and context window performance to determine its value for discrete tasks like agentic workflows, rather than just its raw intelligence.

Progress in complex, long-running agentic tasks is better measured by tokens consumed rather than raw time. Improving token efficiency, as seen from GPT-5 to 5.1, directly enables more tool calls and actions within a feasible operational budget, unlocking greater capabilities.

Classifying a model as "reasoning" based on a chain-of-thought step is no longer useful. With massive differences in token efficiency, a so-called "reasoning" model can be faster and cheaper than a "non-reasoning" one for a given task. The focus is shifting to a continuous spectrum of capability versus overall cost.

A paradox exists where the cost for a fixed level of AI capability (e.g., GPT-4 level) has dropped 100-1000x. However, overall enterprise spend is increasing because applications now use frontier models with massive contexts and multi-step agentic workflows, creating huge multipliers on token usage that drive up total costs.

The binary distinction between "reasoning" and "non-reasoning" models is becoming obsolete. The more critical metric is now "token efficiency"—a model's ability to use more tokens only when a task's difficulty requires it. This dynamic token usage is a key differentiator for cost and performance.

In complex, multi-step tasks, overall cost is determined by tokens per turn and the total number of turns. A more intelligent, expensive model can be cheaper overall if it solves a problem in two turns, while a cheaper model might take ten turns, accumulating higher total costs. Future benchmarks must measure this turn efficiency.

While the cost for GPT-4 level intelligence has dropped over 100x, total enterprise AI spend is rising. This is driven by multipliers: using larger frontier models for harder tasks, reasoning-heavy workflows that consume more tokens, and complex, multi-turn agentic systems.