Goodfire's 'Intentional Design' Aims to Shape Model Learning, Not Just Reverse-Engineer It

Related Insights

Reinforcement Learning Is Like Teaching a Child Only With Rewards, Lacking True Intentionality

RLHF is criticized as a primitive, sample-inefficient way to align models, like "slurping feedback through a straw." The goal of interpretability-driven design is to move beyond this, enabling expert feedback that explains *why* a behavior is wrong, not just that it is.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·5 months ago

Interpretability Is a Bi-Directional Interface: Humans Control AI, AI Teaches Humans

Goodfire frames interpretability as the core of the AI-human interface. One direction is intentional design, allowing human control. The other, especially with superhuman scientific models, is extracting novel knowledge (e.g., new Alzheimer's biomarkers) that the AI discovers.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·5 months ago

Activation Steering and In-Context Learning Might Be Formally Equivalent

Research suggests a formal equivalence between modifying a model's internal activations (steering) and providing prompt examples (in-context learning). This framework could potentially create a formula to convert between the two techniques, even for complex behaviors like jailbreaks.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·5 months ago

AI Models Are "Grown" Like Crops, Not Engineered, Leading to Unpredictable Behavior

AI development is more like farming than engineering. Companies create conditions for models to learn but don't directly code their behaviors. This leads to a lack of deep understanding and results in emergent, unpredictable actions that were never explicitly programmed.

#1011 - Eliezer Yudkowsky - Why Superhuman AI Would Kill Us All

Modern Wisdom·9 months ago

Mechanistic Interpretability Aims to Be for AI What Biology Is for Evolution

Just as biology deciphers the complex systems created by evolution, mechanistic interpretability seeks to understand the "how" inside neural networks. Instead of treating models as black boxes, it examines their internal parameters and activations to reverse-engineer how they work, moving beyond just measuring their external behavior.

2025 Highlight-o-thon: Oops! All Bests

80,000 Hours Podcast·7 months ago

Effective AI Control Doesn't Fight Backpropagation, It Reshapes the Loss Landscape

Trying to simply block a model from learning an undesirable behavior is futile; gradient descent will find a way around the obstacle. Truly effective techniques must alter the loss landscape so the model naturally "wants" to learn the desired behavior.

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·4 months ago

AI Models Can Be Steered by Decomposing Gradient Updates Into Semantic Concepts

Using a sparse autoencoder to identify active concepts, one can project a model's gradient update onto these concepts. This reveals what the model is learning (e.g., "pirate speak" vs. "arithmetic") and allows for selectively amplifying or suppressing specific learning directions.

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·4 months ago

Goodfire AI's Research Agenda is Driven by Real-World Failures of Existing Methods

Instead of pure academic exploration, Goodfire tests state-of-the-art interpretability techniques on customer problems. The shortcomings and failures they encounter directly inform their fundamental research priorities, ensuring their work remains commercially relevant.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·5 months ago

Goodfire AI Pushes Interpretability From Research Labs to High-Stakes Production Use Cases

Goodfire AI defines interpretability broadly, focusing on applying research to high-stakes production scenarios like healthcare. This strategy aims to bridge the gap between theoretical understanding and the practical, real-world application of AI models.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·5 months ago

Training AI to Predict Human Brain Activity Could Create More Human-Like Models

A novel training method involves adding an auxiliary task for AI models: predicting the neural activity of a human observing the same data. This "brain-augmented" learning could force the model to adopt more human-like internal representations, improving generalization and alignment beyond what simple labels can provide.

Adam Marblestone – AI is missing something fundamental about the brain

Dwarkesh Podcast·7 months ago

Get your free personalized podcast brief

Related Insights