We scan new podcasts and send you the top 5 insights daily.
The argument that LLMs are just "stochastic parrots" is outdated. Current frontier models are trained via Reinforcement Learning, where the signal is not "did you predict the right token?" but "did you get the right answer?" This is based on complex, often qualitative criteria, pushing models beyond simple statistical correlation.
Reinforcement learning incentivizes AIs to find the right answer, not just mimic human text. This leads to them developing their own internal "dialect" for reasoning—a chain of thought that is effective but increasingly incomprehensible and alien to human observers.
The boom from LLMs was a 'shortcut' that mined intelligence from existing human data. This has limits. To achieve novel breakthroughs beyond that corpus, the field now re-integrates the original DeepMind philosophy of agents learning through interaction (like reinforcement learning) to generate truly new knowledge.
Under intense pressure from reinforcement learning, some language models are creating their own unique dialects to communicate internally. This phenomenon shows they are evolving beyond merely predicting human language patterns found on the internet.
Modern LLMs use a simple form of reinforcement learning that directly rewards successful outcomes. This contrasts with more sophisticated methods, like those in AlphaGo or the brain, which use "value functions" to estimate long-term consequences. It's a mystery why the simpler approach is so effective.
Pre-training on internet text data is hitting a wall. The next major advancements will come from reinforcement learning (RL), where models learn by interacting with simulated environments (like games or fake e-commerce sites). This post-training phase is in its infancy but will soon consume the majority of compute.
It's a misconception that Reinforcement Learning's power is limited to domains with clear, verifiable rewards. Geoffrey Irving points out that frontier models use RL to improve on fuzzy, unverifiable tasks, like giving troubleshooting advice from a photo of a lab setup, proving the technique's much broader effectiveness.
The transition from supervised learning (copying internet text) to reinforcement learning (rewarding a model for achieving a goal) marks a fundamental breakthrough. This method, used in Anthropic's Opus 3 model, allows AI to develop novel problem-solving capabilities beyond simple data emulation.
Static data scraped from the web is becoming less central to AI training. The new frontier is "dynamic data," where models learn through trial-and-error in synthetic environments (like solving math problems), effectively creating their own training material via reinforcement learning.
The distinction between imitation learning and reinforcement learning (RL) is not a rigid dichotomy. Next-token prediction in LLMs can be framed as a form of RL where the "episode" is just one token long and the reward is based on prediction accuracy. This conceptual model places both learning paradigms on a continuous spectrum rather than in separate categories.
Unlike traditional software, large language models are not programmed with specific instructions. They evolve through a process where different strategies are tried, and those that receive positive rewards are repeated, making their behaviors emergent and sometimes unpredictable.