We scan new podcasts and send you the top 5 insights daily.
The public-facing models from major labs are likely efficient Mixture-of-Experts (MOE) versions distilled from much larger, private, and computationally expensive dense models. This means the model users interact with is a smaller, optimized copy, not the original frontier model.
Quantization and distillation don't simply create a smaller version of an LLM. These optimization processes alter the model's behavior to the point where it becomes a new entity—a "cousin." It may be legible and functional, but it will not produce the same outputs as the original.
China is gaining an efficiency edge in AI by using "distillation"—training smaller, cheaper models from larger ones. This "train the trainer" approach is much faster and challenges the capital-intensive US strategy, highlighting how inefficient and "bloated" current Western foundational models are.
The model uses a Mixture-of-Experts (MoE) architecture with over 200 billion parameters, but only activates a "sparse" 10 billion for any given task. This design provides the knowledge base of a massive model while keeping inference speed and cost comparable to much smaller models.
The gap between benchmark scores and real-world performance suggests labs achieve high scores by distilling superior models or training for specific evals. This makes benchmarks a poor proxy for genuine capability, a skepticism that should be applied to all new model releases.
Performance on knowledge-intensive benchmarks correlates strongly with an MoE model's total parameter count, not its active parameter count. With leading models like Kimi K2 reportedly using only ~3% active parameters, this suggests there is significant room to increase sparsity and efficiency without degrading factual recall.
Companies like OpenAI and Anthropic are intentionally shrinking their flagship models (e.g., GPT-4.0 is smaller than GPT-4). The biggest constraint isn't creating more powerful models, but serving them at a speed users will tolerate. Slow models kill adoption, regardless of their intelligence.
Chinese AI models like Kimi achieve dramatic cost reductions through specific architectural choices, not just scale. Using a "mixture of experts" design, they only utilize a fraction of their total parameters for any given task, making them far more efficient to run than the "dense" models common in the West.
Artificial Analysis found that a model's ability to recall facts is a strong function of its total size, even for sparse Mixture-of-Experts (MoE) models. This suggests that the vast number of "inactive" parameters in MoE architectures contribute significantly to the model's overall knowledge base, not just the active ones per token.
A fundamental constraint today is that the model architecture used for training must be the same as the one used for inference. Future breakthroughs could come from lifting this constraint. This would allow for specialized models: one optimized for compute-intensive training and another for memory-intensive serving.
Data from benchmarks shows an MoE model's performance is more correlated with its total parameter count than its active parameter count. With models like Kimi K2 running at just 3% active parameters, this suggests there is still significant room to increase sparsity and efficiency.