The entire deep learning paradigm, including backpropagation, can be viewed as a form of in-context learning. This reframes the pre-training phase not as a separate process, but as the model forming a long-term associative memory, unifying it with inference-time adaptation.
The distinction between a model's architecture and its optimizer is an illusion. Both are learning processes compressing a flow of context—the architecture compresses tokens, while the optimizer compresses gradients. This unified view allows for designing them as one interconnected system.
Attention can be understood as an update module with an infinite frequency. It acts as a perfect cache, accessing the entire context at once. However, this is also its weakness: it lacks an inherent understanding of temporal dependency and sequential reasoning, requiring positional encodings as a crutch.
A genuinely continual learner doesn't have separate training and testing phases. Instead, its life is a continuous process divided into two modes: an 'active' phase of interacting with new data and an 'offline' sleep phase for memory consolidation and self-improvement.
Models that learn continually present a fundamental tradeoff. They offer the opportunity to deeply align with an individual user's values and needs over time. However, this same capability creates a huge risk, as the model could continuously learn and retain sensitive personal information.
The goal of AI development shouldn't be to perfectly replicate human cognition, a complex and perhaps unfalsifiable target. Instead, a more pragmatic approach is to draw high-level inspiration from nature to build novel forms of intelligence designed specifically to understand and serve human needs.
A self-referential or self-modifying model, which generates its own update values based on its current state and inputs, is more powerful than a static one. This process is akin to 'learning how to learn,' allowing for greater adaptability and performance on sequential reasoning tasks.
Inspired by human sleep, AI models can enter an offline mode. During this 'sleep,' they consolidate new knowledge from fast-updating layers into slow-updating ones via distillation. They also 'dream' by generating synthetic data from recent experiences to form new abstractions and connections.
Rather than one model ruling all, continual learning could lead to a diverse ecosystem of specialized AIs. Over time, models personalized to specific users or tasks will naturally forget irrelevant information. This differentiation is a feature, not a bug, potentially creating a more stable and less monolithic AI landscape.
The 'dreaming' phase in continual learning isn't just for memory consolidation. It serves to actively find connections between concepts that seem unrelated based on recent experiences. This process allows the model to form new, higher-level abstractions and insights, mirroring a key function of human dreaming.
Future AI expressivity won't come from adding more identical layers, but from 'nesting' levels with different update frequencies. This allows some parts of the system to adapt rapidly (like working memory) while others preserve core knowledge (long-term memory), mimicking human cognition.
While transformers fail, nested learning models (Hope) can learn to translate two previously unseen languages at the same time within a single context. This demonstrates superior memory management, as different frequency layers handle different levels of abstraction, preventing the catastrophic forgetting seen in standard architectures.
