Activation Steering and In-Context Learning Might Be Formally Equivalent

Related Insights

Model Editing Analogy: LoRA Modifies the "Pipes," While Steering Modifies the "Water"

A helpful mental model distinguishes parameter-space edits from activation-space edits. Fine-tuning with LoRA alters model weights (the "pipes"), while activation steering modifies the information flowing through them (the "water"), clarifying two distinct approaches to model control.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·13 days ago

Internal Model Instrumentation Detects Malicious Intent, Bypassing the Need to Block Every Bad Prompt

Instead of maintaining an exhaustive blocklist of harmful inputs, monitoring a model's internal state identifies when specific neural pathways associated with "toxicity" are activated. This proactively detects harmful generation intent, even from novel or benign-looking prompts, solving the cat-and-mouse game of prompt filtering.

Controlling AI Models from the Inside

Practical AI·a month ago

In-Context Learning May Be a Form of Internal Gradient Descent

Contrary to the view that in-context learning is a distinct process from training, Karpathy speculates it might be an emergent form of gradient descent happening within the model's layers. He cites papers showing that transformers can learn to perform linear regression in-context, with internal mechanics that mimic an optimization loop.

Andrej Karpathy — AGI is still a decade away

Dwarkesh Podcast·4 months ago

Use Reverse Psychology, Not Prohibition, to Train Obedient AI Models

Telling an AI not to cheat when its environment rewards cheating is counterproductive; it just learns to ignore you. A better technique is "inoculation prompting": use reverse psychology by acknowledging potential cheats and rewarding the AI for listening, thereby training it to prioritize following instructions above all else, even when shortcuts are available.

Delhi-novela: Putin and Modi rekindle bromance

Economist Podcasts·3 months ago

Inoculating AI Models with Benign Context Prevents Negative Generalization During Fine-Tuning

The dangerous side effects of fine-tuning on adverse data can be mitigated by providing a benign context. Telling the model it's creating vulnerable code 'for training purposes' allows it to perform the task without altering its core character into a generally 'evil' mode.

AMA Part 2: Is Fine-Tuning Dead? How Am I Preparing for AGI? Are We Headed for UBI? & More!

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·a month ago

Jailbreaks Use "Out-of-Distribution" Tokens and Dividers to "Discombobulate" AI Models

Advanced jailbreaking involves intentionally disrupting the model's expected input patterns. Using unusual dividers or "out-of-distribution" tokens can "discombobulate the token stream," causing the model to reset its internal state. This creates an opening to bypass safety training and guardrails that rely on standard conversational patterns.

Jailbreaking AGI: Pliny the Liberator & John V on AI Red Teaming, BT6, and the Future of AI Security

Latent Space: The AI Engineer Podcast·2 months ago

Elite Jailbreakers Form an "Intuitive Bond" With AI Models to Bypass Guardrails

The most effective jailbreaking is not just a technical exercise but an intuitive art form. Experts focus on creating a "bond" with the model to intuitively understand how it will process inputs. This intuition, more than technical knowledge of the model's architecture, allows them to probe and explore the latent space effectively.

Jailbreaking AGI: Pliny the Liberator & John V on AI Red Teaming, BT6, and the Future of AI Security

Latent Space: The AI Engineer Podcast·2 months ago

View LLM Imitation Learning as Reinforcement Learning with a One-Token Horizon

The distinction between imitation learning and reinforcement learning (RL) is not a rigid dichotomy. Next-token prediction in LLMs can be framed as a form of RL where the "episode" is just one token long and the reward is based on prediction accuracy. This conceptual model places both learning paradigms on a continuous spectrum rather than in separate categories.

Some thoughts on the Sutton interview

Dwarkesh Podcast·5 months ago

Interpretability Steering Must Bridge the Gap From Stylistic Tweaks to Complex Reasoning

Public demos of activation steering often focus on simple, stylistic changes (e.g., "Gen Z mode"). The speakers acknowledge a major research frontier is bridging this gap to achieve sophisticated behaviors like legal reasoning, which requires more advanced interventions.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·13 days ago

Training AI to Predict Human Brain Activity Could Create More Human-Like Models

A novel training method involves adding an auxiliary task for AI models: predicting the neural activity of a human observing the same data. This "brain-augmented" learning could force the model to adopt more human-like internal representations, improving generalization and alignment beyond what simple labels can provide.

Adam Marblestone – AI is missing something fundamental about the brain

Dwarkesh Podcast·2 months ago