Artificial Analysis found that a model's ability to recall facts is a strong function of its total size, even for sparse Mixture-of-Experts (MoE) models. This suggests that the vast number of "inactive" parameters in MoE architectures contribute significantly to the model's overall knowledge base, not just the active ones per token.

Related Insights

A core debate in AI is whether LLMs, which are text prediction engines, can achieve true intelligence. Critics argue they cannot because they lack a model of the real world. This prevents them from making meaningful, context-aware predictions about future events—a limitation that more data alone may not solve.

LLMs learn two things from pre-training: factual knowledge and intelligent algorithms (the "cognitive core"). Karpathy argues the vast memorized knowledge is a hindrance, making models rely on memory instead of reasoning. The goal should be to strip away this knowledge to create a pure, problem-solving cognitive entity.

When AI pioneers like Geoffrey Hinton see agency in an LLM, they are misinterpreting the output. What they are actually witnessing is a compressed, probabilistic reflection of the immense creativity and knowledge from all the humans who created its training data. It's an echo, not a mind.

Karpathy identifies a key missing piece for continual learning in AI: an equivalent to sleep. Humans seem to use sleep to distill the day's experiences (their "context window") into the compressed weights of the brain. LLMs lack this distillation phase, forcing them to restart from a fixed state in every new session.

The "bitter lesson" in AI research posits that methods leveraging massive computation scale better and ultimately win out over approaches that rely on human-designed domain knowledge or clever shortcuts, favoring scale over ingenuity.

Classifying a model as "reasoning" based on a chain-of-thought step is no longer useful. With massive differences in token efficiency, a so-called "reasoning" model can be faster and cheaper than a "non-reasoning" one for a given task. The focus is shifting to a continuous spectrum of capability versus overall cost.

Google's Titans architecture for LLMs mimics human memory by applying Claude Shannon's information theory. It scans vast data streams and identifies "surprise"—statistically unexpected or rare information relative to its training data. This novel data is then prioritized for long-term memory, preventing clutter from irrelevant information.

Chinese AI models like Kimi achieve dramatic cost reductions through specific architectural choices, not just scale. Using a "mixture of experts" design, they only utilize a fraction of their total parameters for any given task, making them far more efficient to run than the "dense" models common in the West.

Unlike humans, whose poor memory forces them to generalize and find patterns, LLMs are incredibly good at memorization. Karpathy argues this is a flaw. It distracts them with recalling specific training documents instead of focusing on the underlying, generalizable algorithms of thought, hindering true understanding.

To improve LLM reasoning, researchers feed them data that inherently contains structured logic. Training on computer code was an early breakthrough, as it teaches patterns of reasoning far beyond coding itself. Textbooks are another key source for building smaller, effective models.