Data from benchmarks shows an MoE model's performance is more correlated with its total parameter count than its active parameter count. With models like Kimi K2 running at just 3% active parameters, this suggests there is still significant room to increase sparsity and efficiency.
A 10x increase in compute may only yield a one-tier improvement in model performance. This appears inefficient but can be the difference between a useless "6-year-old" intelligence and a highly valuable "16-year-old" intelligence, unlocking entirely new economic applications.
The "Bitter Lesson" is not just about using more compute, but leveraging it scalably. Current LLMs are inefficient because they only learn during a discrete training phase, not during deployment where most computation occurs. This reliance on a special, data-intensive training period is not a scalable use of computational resources.
The "Omniscience" accuracy benchmark, which measures pure factual knowledge, tracks more closely with a model's total parameters than any other metric. This suggests embedded knowledge is a direct function of model size, distinct from reasoning abilities developed via training techniques.
Classifying a model as "reasoning" based on a chain-of-thought step is no longer useful. With massive differences in token efficiency, a so-called "reasoning" model can be faster and cheaper than a "non-reasoning" one for a given task. The focus is shifting to a continuous spectrum of capability versus overall cost.
Analysis of model performance reveals a distinct shift with GPT-4 and subsequent models. These newer models are much more correlated with each other in the tasks they succeed or fail on compared to the pre-GPT-4 era. This could suggest a convergence in training data, architectures, or agent scaffolding methodologies across different labs.
Model architecture decisions directly impact inference performance. AI company Zyphra pre-selects target hardware and then chooses model parameters—such as a hidden dimension with many powers of two—to align with how GPUs split up workloads, maximizing efficiency from day one.
Companies like OpenAI and Anthropic are intentionally shrinking their flagship models (e.g., GPT-4.0 is smaller than GPT-4). The biggest constraint isn't creating more powerful models, but serving them at a speed users will tolerate. Slow models kill adoption, regardless of their intelligence.
Chinese AI models like Kimi achieve dramatic cost reductions through specific architectural choices, not just scale. Using a "mixture of experts" design, they only utilize a fraction of their total parameters for any given task, making them far more efficient to run than the "dense" models common in the West.
Artificial Analysis found that a model's ability to recall facts is a strong function of its total size, even for sparse Mixture-of-Experts (MoE) models. This suggests that the vast number of "inactive" parameters in MoE architectures contribute significantly to the model's overall knowledge base, not just the active ones per token.
The binary distinction between "reasoning" and "non-reasoning" models is becoming obsolete. The more critical metric is now "token efficiency"—a model's ability to use more tokens only when a task's difficulty requires it. This dynamic token usage is a key differentiator for cost and performance.