Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

The entire deep learning paradigm, including backpropagation, can be viewed as a form of in-context learning. This reframes the pre-training phase not as a separate process, but as the model forming a long-term associative memory, unifying it with inference-time adaptation.

Related Insights

Dario Amodei suggests that the massive data requirement for AI pre-training is not a flaw but a different paradigm. It is analogous to the long process of human evolution setting up our brain's priors, not just an individual's lifetime of learning, which explains its sample inefficiency.

Karpathy identifies a key missing piece for continual learning in AI: an equivalent to sleep. Humans seem to use sleep to distill the day's experiences (their "context window") into the compressed weights of the brain. LLMs lack this distillation phase, forcing them to restart from a fixed state in every new session.

When an LLM is shown few-shot examples of a new task, it is performing Bayesian updating. With each example provided in the prompt, its belief (posterior probability) about the correct next token shifts, allowing it to "learn" a new pattern on the fly without changing its weights.

Attempting to interpret every learned circuit in a complex neural network is a futile effort. True understanding comes from describing the system's foundational elements: its architecture, learning rule, loss functions, and the data it was trained on. The emergent complexity is a result of this process.

Contrary to the view that in-context learning is a distinct process from training, Karpathy speculates it might be an emergent form of gradient descent happening within the model's layers. He cites papers showing that transformers can learn to perform linear regression in-context, with internal mechanics that mimic an optimization loop.

Research suggests a formal equivalence between modifying a model's internal activations (steering) and providing prompt examples (in-context learning). This framework could potentially create a formula to convert between the two techniques, even for complex behaviors like jailbreaks.

The distinction between a model's architecture and its optimizer is an illusion. Both are learning processes compressing a flow of context—the architecture compresses tokens, while the optimizer compresses gradients. This unified view allows for designing them as one interconnected system.

Contrary to the belief that memorization requires multiple training epochs, large language models demonstrate the capacity to perfectly recall specific information after seeing it only once. This surprising phenomenon highlights how understudied the information theory behind LLMs still is.

Using a sparse autoencoder to identify active concepts, one can project a model's gradient update onto these concepts. This reveals what the model is learning (e.g., "pirate speak" vs. "arithmetic") and allows for selectively amplifying or suppressing specific learning directions.

The key to continual learning is not just a longer context window, but a new architecture with a spectrum of memory types. "Nested learning" proposes a model with different layers that update at different frequencies—from transient working memory to persistent core knowledge—mimicking how humans learn without catastrophic forgetting.