Attempting to interpret every learned circuit in a complex neural network is a futile effort. True understanding comes from describing the system's foundational elements: its architecture, learning rule, loss functions, and the data it was trained on. The emergent complexity is a result of this process.
Reinforcement learning incentivizes AIs to find the right answer, not just mimic human text. This leads to them developing their own internal "dialect" for reasoning—a chain of thought that is effective but increasingly incomprehensible and alien to human observers.
The progression from early neural networks to today's massive models is fundamentally driven by the exponential increase in available computational power, from the initial move to GPUs to today's million-fold increases in training capacity on a single model.
Contrary to the view that in-context learning is a distinct process from training, Karpathy speculates it might be an emergent form of gradient descent happening within the model's layers. He cites papers showing that transformers can learn to perform linear regression in-context, with internal mechanics that mimic an optimization loop.
The ambition to fully reverse-engineer AI models into simple, understandable components is proving unrealistic as their internal workings are messy and complex. Its practical value is less about achieving guarantees and more about coarse-grained analysis, such as identifying when specific high-level capabilities are being used.
Current AI can learn to predict complex patterns, like planetary orbits, from data. However, it struggles to abstract the underlying causal laws, such as Newtonian physics (F=MA). This leap to a higher level of abstraction remains a fundamental challenge beyond simple pattern recognition.
As AI models are used for critical decisions in finance and law, black-box empirical testing will become insufficient. Mechanistic interpretability, which analyzes model weights to understand reasoning, is a bet that society and regulators will require explainable AI, making it a crucial future technology.
For AI systems to be adopted in scientific labs, they must be interpretable. Researchers need to understand the 'why' behind an AI's experimental plan to validate and trust the process, making interpretability a more critical feature than raw predictive power.
Demanding interpretability from AI trading models is a fallacy because they operate at a superhuman level. An AI predicting a stock's price in one minute is processing data in a way no human can. Expecting a simple, human-like explanation for its decision is unreasonable, much like asking a chess engine to explain its moves in prose.
Unlike traditional software, large language models are not programmed with specific instructions. They evolve through a process where different strategies are tried, and those that receive positive rewards are repeated, making their behaviors emergent and sometimes unpredictable.
AI models use simple, mathematically clean loss functions. The human brain's superior learning efficiency might stem from evolution hard-coding numerous, complex, and context-specific loss functions that activate at different developmental stages, creating a sophisticated learning curriculum.