AI Models Can Be Steered by Decomposing Gradient Updates Into Semantic Concepts

Related Insights

Model Editing Analogy: LoRA Modifies the "Pipes," While Steering Modifies the "Water"

A helpful mental model distinguishes parameter-space edits from activation-space edits. Fine-tuning with LoRA alters model weights (the "pipes"), while activation steering modifies the information flowing through them (the "water"), clarifying two distinct approaches to model control.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·5 months ago

OpenAI's Models Haven't Drifted to Uninterpretable 'Neural Ease' Despite RL Pressure

Contrary to fears that reinforcement learning would push models' internal reasoning (chain-of-thought) into an unexplainable shorthand, OpenAI has not seen significant evidence of this "neural ease." Models still predominantly use plain English for their internal monologue, a pleasantly surprising empirical finding that preserves a crucial method for safety research and interpretability.

Universal Medical Intelligence: OpenAI's Plan to Elevate Human Health, with Karan Singhal

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·5 months ago

AI Interpretability Is Shifting From Identifying Concepts to Mapping Their Geometric Relationships

The field is moving beyond labeling concepts with sparse autoencoders. The new frontier is understanding the intricate geometric structures (manifolds) these concepts form in a model's latent space and how circuits transform them, providing a more unified, dynamic view.

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·4 months ago

Activation Steering and In-Context Learning Might Be Formally Equivalent

Research suggests a formal equivalence between modifying a model's internal activations (steering) and providing prompt examples (in-context learning). This framework could potentially create a formula to convert between the two techniques, even for complex behaviors like jailbreaks.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·5 months ago

Machine Unlearning Actively Suppresses Dangerous Knowledge in AI Models

A novel safety technique, 'machine unlearning,' goes beyond simple refusal prompts by training a model to actively 'forget' or suppress knowledge on illicit topics. When encountering these topics, the model's internal representations are fuzzed, effectively making it 'stupid' on command for specific domains.

Inside The Second International AI Safety Report with Writers Stephen Clare and Stephen Casper

The AI Policy Podcast·5 months ago

Interpretability Probes on Raw Activations Can Outperform Advanced Sparse Autoencoder (SAE) Methods

Goodfire AI found that for certain tasks, simple classifiers trained on a model's raw activations performed better than those using features from Sparse Autoencoders (SAEs). This surprising result challenges the assumption that SAEs always provide a cleaner concept space.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·5 months ago

Effective AI Control Doesn't Fight Backpropagation, It Reshapes the Loss Landscape

Trying to simply block a model from learning an undesirable behavior is futile; gradient descent will find a way around the obstacle. Truly effective techniques must alter the loss landscape so the model naturally "wants" to learn the desired behavior.

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·4 months ago

Goodfire's 'Intentional Design' Aims to Shape Model Learning, Not Just Reverse-Engineer It

Instead of only analyzing a fully trained model, "intentional design" seeks to control what a model learns during training. The goal is to shape the loss landscape to produce desired behaviors and generalizations from the outset, moving from archaeology to architecture.

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·4 months ago

Removing an AI's Memorized Facts Can Counterintuitively Improve Its Reasoning

Research shows it's possible to distinguish and remove model weights used for memorizing facts versus those for general reasoning. Surprisingly, pruning these memorization weights can improve a model's performance on some reasoning tasks, suggesting a path toward creating more efficient, focused AI reasoners.

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·4 months ago

Superhuman AI Models Can Learn Alien Heuristics Instead of Human-Understood Principles

Even when a model performs a task correctly, interpretability can reveal it learned a bizarre, "alien" heuristic that is functionally equivalent but not the generalizable, human-understood principle. This highlights the challenge of ensuring models truly "grok" concepts.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·5 months ago

Get your free personalized podcast brief

Related Insights