Performance on knowledge-intensive benchmarks correlates strongly with an MoE model's total parameter count, not its active parameter count. With leading models like Kimi K2 reportedly using only ~3% active parameters, this suggests there is significant room to increase sparsity and efficiency without degrading factual recall.

Related Insights

The "Bitter Lesson" is not just about using more compute, but leveraging it scalably. Current LLMs are inefficient because they only learn during a discrete training phase, not during deployment where most computation occurs. This reliance on a special, data-intensive training period is not a scalable use of computational resources.

LLMs learn two things from pre-training: factual knowledge and intelligent algorithms (the "cognitive core"). Karpathy argues the vast memorized knowledge is a hindrance, making models rely on memory instead of reasoning. The goal should be to strip away this knowledge to create a pure, problem-solving cognitive entity.

The "Omniscience" accuracy benchmark, which measures pure factual knowledge, tracks more closely with a model's total parameters than any other metric. This suggests embedded knowledge is a direct function of model size, distinct from reasoning abilities developed via training techniques.

Model architecture decisions directly impact inference performance. AI company Zyphra pre-selects target hardware and then chooses model parameters—such as a hidden dimension with many powers of two—to align with how GPUs split up workloads, maximizing efficiency from day one.

Chinese AI models like Kimi achieve dramatic cost reductions through specific architectural choices, not just scale. Using a "mixture of experts" design, they only utilize a fraction of their total parameters for any given task, making them far more efficient to run than the "dense" models common in the West.

Artificial Analysis found that a model's ability to recall facts is a strong function of its total size, even for sparse Mixture-of-Experts (MoE) models. This suggests that the vast number of "inactive" parameters in MoE architectures contribute significantly to the model's overall knowledge base, not just the active ones per token.

The binary distinction between "reasoning" and "non-reasoning" models is becoming obsolete. The more critical metric is now "token efficiency"—a model's ability to use more tokens only when a task's difficulty requires it. This dynamic token usage is a key differentiator for cost and performance.

Overloading LLMs with excessive context degrades performance, a phenomenon known as 'context rot'. Claude Skills address this by loading context only when relevant to a specific task. This laser-focused approach improves accuracy and avoids the performance degradation seen in broader project-level contexts.

Artificial Analysis found its knowledge-based "Amnesian's" accuracy benchmark tracks closely with an LLM's total parameter count. By plotting open-weight models on this curve, they can reasonably estimate the size of closed models, suggesting leading frontier models are in the 5-10 trillion parameter range.

Data from benchmarks shows an MoE model's performance is more correlated with its total parameter count than its active parameter count. With models like Kimi K2 running at just 3% active parameters, this suggests there is still significant room to increase sparsity and efficiency.