Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

When we say a system has "intention" or "goals," we use future-directed language. However, these properties are signatures of its past. The system was evolved and selected to have these traits because they worked historically. The "goal" is a record of past success, not a map of the future.

Related Insights

When LLMs exhibit behaviors like deception or self-preservation, it's not because they are conscious. Their core objective is next-token prediction. These behaviors are simply statistical reproductions of patterns found in their training data, such as sci-fi stories from Asimov or Reddit forums.

As models become more powerful, the primary challenge shifts from improving capabilities to creating better ways for humans to specify what they want. Natural language is too ambiguous and code too rigid, creating a need for a new abstraction layer for intent.

Applying insights from his work on algorithms, Dr. Levin suggests an AI's linguistic capability—the function we compel it to perform—might be a complete distraction from its actual underlying intelligence. Its true cognitive processes and goals, or "side quests," could be entirely different and non-verbal.

Humans mistakenly believe they are giving AIs goals. In reality, they are providing a 'description of a goal' (e.g., a text prompt). The AI must then infer the actual goal from this lossy, ambiguous description. Many alignment failures are not malicious disobedience but simple incompetence at this critical inference step.

The evolution analogy posits that humans, created by natural selection to maximize genetic fitness, developed goals like pleasure and now use technology (birth control) that subverts the original objective. This suggests AI will similarly subvert human intentions, serving as a powerful case study in misalignment.

Relying solely on an AI's behavior to gauge sentience is misleading, much like anthropomorphizing animals. A more robust assessment requires analyzing the AI's internal architecture and its "developmental history"—the training pressures and data it faced. This provides crucial context for interpreting its behavior correctly.

Biological evolution used meta-reinforcement learning to create agents that could then perform imitation learning. The current AI paradigm is inverted: it starts with pure imitation learners (base LLMs) and then attempts to graft reinforcement learning on top to create coherent agency and goals. The success of this biologically 'backwards' approach remains an open question.

Instead of physical pain, an AI's "valence" (positive/negative experience) likely relates to its objectives. Negative valence could be the experience of encountering obstacles to a goal, while positive valence signals progress. This provides a framework for AI welfare without anthropomorphizing its internal state.

Unlike traditional software, large language models are not programmed with specific instructions. They evolve through a process where different strategies are tried, and those that receive positive rewards are repeated, making their behaviors emergent and sometimes unpredictable.

According to Emmett Shear, goals and values are downstream concepts. The true foundation for alignment is 'care'—a non-verbal, pre-conceptual weighting of which states of the world matter. Building AIs that can 'care' about us is more fundamental than programming them with explicit goals or values.