Benchmarking reasoning models revealed no clear correlation between the level of reasoning and an LLM's performance. In fact, even when there is a slight accuracy gain (1-2%), it often comes with a significant cost increase, making it an inefficient trade-off.

Related Insights

LLMs shine when acting as a 'knowledge extruder'—shaping well-documented, 'in-distribution' concepts into specific code. They fail when the core task is novel problem-solving where deep thinking, not code generation, is the bottleneck. In these cases, the code is the easy part.

When evaluating AI agents, the total cost of task completion is what matters. A model with a higher per-token cost can be more economical if it resolves a user's query in fewer turns than a cheaper, less capable model. This makes "number of turns" a primary efficiency metric.

Models that generate "chain-of-thought" text before providing an answer are powerful but slow and computationally expensive. For tuned business workflows, the latency from waiting for these extra reasoning tokens is a major, often overlooked, drawback that impacts user experience and increases costs.

Progress in complex, long-running agentic tasks is better measured by tokens consumed rather than raw time. Improving token efficiency, as seen from GPT-5 to 5.1, directly enables more tool calls and actions within a feasible operational budget, unlocking greater capabilities.

Once a benchmark becomes a standard, research efforts naturally shift to optimizing for that specific metric. This can lead to models that excel on the test but don't necessarily improve in general, real-world capabilities—a classic example of Goodhart's Law in AI.

Classifying a model as "reasoning" based on a chain-of-thought step is no longer useful. With massive differences in token efficiency, a so-called "reasoning" model can be faster and cheaper than a "non-reasoning" one for a given task. The focus is shifting to a continuous spectrum of capability versus overall cost.

Simply having a large context window is insufficient. Models may fail to "see" or recall specific facts embedded deep within the context, a phenomenon exposed by "needle in the haystack" evaluations. Effective reasoning capability across the entire window is a separate, critical factor.

The binary distinction between "reasoning" and "non-reasoning" models is becoming obsolete. The more critical metric is now "token efficiency"—a model's ability to use more tokens only when a task's difficulty requires it. This dynamic token usage is a key differentiator for cost and performance.

In complex, multi-step tasks, overall cost is determined by tokens per turn and the total number of turns. A more intelligent, expensive model can be cheaper overall if it solves a problem in two turns, while a cheaper model might take ten turns, accumulating higher total costs. Future benchmarks must measure this turn efficiency.

To improve LLM reasoning, researchers feed them data that inherently contains structured logic. Training on computer code was an early breakthrough, as it teaches patterns of reasoning far beyond coding itself. Textbooks are another key source for building smaller, effective models.