Backpropagation Is a Form of In-Context Learning, Reframing Pre-Training as Associative Memory

Related Insights

Anthropic CEO: AI Pre-training Mirrors Human Evolution More Than Individual Learning

Dario Amodei suggests that the massive data requirement for AI pre-training is not a flaw but a different paradigm. It is analogous to the long process of human evolution setting up our brain's priors, not just an individual's lifetime of learning, which explains its sample inefficiency.

Dario Amodei — "We are near the end of the exponential"

Dwarkesh Podcast·5 months ago

LLMs Lack a "Sleep" Phase to Distill Daily Experiences into Long-Term Memory

Karpathy identifies a key missing piece for continual learning in AI: an equivalent to sleep. Humans seem to use sleep to distill the day's experiences (their "context window") into the compressed weights of the brain. LLMs lack this distillation phase, forcing them to restart from a fixed state in every new session.

Andrej Karpathy — AGI is still a decade away

Dwarkesh Podcast·9 months ago

In-Context Learning Is Simply Real-Time Bayesian Updating Based on Prompt Evidence

When an LLM is shown few-shot examples of a new task, it is performing Bayesian updating. With each example provided in the prompt, its belief (posterior probability) about the correct next token shifts, allowing it to "learn" a new pattern on the fly without changing its weights.

What's Missing Between LLMs and AGI - Vishal Misra & Martin Casado

The a16z Show·4 months ago

To Understand a Neural Network, Focus on Its Training Process, Not Its Final Weights

Attempting to interpret every learned circuit in a complex neural network is a futile effort. True understanding comes from describing the system's foundational elements: its architecture, learning rule, loss functions, and the data it was trained on. The emergent complexity is a result of this process.

Adam Marblestone – AI is missing something fundamental about the brain

Dwarkesh Podcast·7 months ago

In-Context Learning May Be a Form of Internal Gradient Descent

Contrary to the view that in-context learning is a distinct process from training, Karpathy speculates it might be an emergent form of gradient descent happening within the model's layers. He cites papers showing that transformers can learn to perform linear regression in-context, with internal mechanics that mimic an optimization loop.

Andrej Karpathy — AGI is still a decade away

Dwarkesh Podcast·9 months ago

Activation Steering and In-Context Learning Might Be Formally Equivalent

Research suggests a formal equivalence between modifying a model's internal activations (steering) and providing prompt examples (in-context learning). This framework could potentially create a formula to convert between the two techniques, even for complex behaviors like jailbreaks.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·5 months ago

AI Architectures and Optimizers Are Both Learning Rules Operating on Different Contexts

The distinction between a model's architecture and its optimizer is an illusion. Both are learning processes compressing a flow of context—the architecture compresses tokens, while the optimizer compresses gradients. This unified view allows for designing them as one interconnected system.

Nested Learning: Ali Behrouz on the Quest for Continual Learning & Illusion of AI Architectures

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·2 months ago

LLMs Can Memorize Data After a Single Training Pass, Defying Common ML Intuition

Contrary to the belief that memorization requires multiple training epochs, large language models demonstrate the capacity to perfectly recall specific information after seeing it only once. This surprising phenomenon highlights how understudied the information theory behind LLMs still is.

[LIVE] Anthropic Distillation & How Models Cheat (SWE-Bench Dead) | Nathan Lambert & Sebastian Raschka

Latent Space: The AI Engineer Podcast·5 months ago

AI Models Can Be Steered by Decomposing Gradient Updates Into Semantic Concepts

Using a sparse autoencoder to identify active concepts, one can project a model's gradient update onto these concepts. This reveals what the model is learning (e.g., "pirate speak" vs. "arithmetic") and allows for selectively amplifying or suppressing specific learning directions.

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·5 months ago

True Continual Learning Requires "Nested" Architectures with Varied Memory Update Speeds

The key to continual learning is not just a longer context window, but a new architecture with a spectrum of memory types. "Nested learning" proposes a model with different layers that update at different frequencies—from transient working memory to persistent core knowledge—mimicking how humans learn without catastrophic forgetting.

AI 2025 → 2026 Live Show | Part 1

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·7 months ago

Get your free personalized podcast brief

Related Insights