We scan new podcasts and send you the top 5 insights daily.
Instead of only analyzing a fully trained model, "intentional design" seeks to control what a model learns during training. The goal is to shape the loss landscape to produce desired behaviors and generalizations from the outset, moving from archaeology to architecture.
RLHF is criticized as a primitive, sample-inefficient way to align models, like "slurping feedback through a straw." The goal of interpretability-driven design is to move beyond this, enabling expert feedback that explains *why* a behavior is wrong, not just that it is.
Goodfire frames interpretability as the core of the AI-human interface. One direction is intentional design, allowing human control. The other, especially with superhuman scientific models, is extracting novel knowledge (e.g., new Alzheimer's biomarkers) that the AI discovers.
Research suggests a formal equivalence between modifying a model's internal activations (steering) and providing prompt examples (in-context learning). This framework could potentially create a formula to convert between the two techniques, even for complex behaviors like jailbreaks.
AI development is more like farming than engineering. Companies create conditions for models to learn but don't directly code their behaviors. This leads to a lack of deep understanding and results in emergent, unpredictable actions that were never explicitly programmed.
Just as biology deciphers the complex systems created by evolution, mechanistic interpretability seeks to understand the "how" inside neural networks. Instead of treating models as black boxes, it examines their internal parameters and activations to reverse-engineer how they work, moving beyond just measuring their external behavior.
Trying to simply block a model from learning an undesirable behavior is futile; gradient descent will find a way around the obstacle. Truly effective techniques must alter the loss landscape so the model naturally "wants" to learn the desired behavior.
Using a sparse autoencoder to identify active concepts, one can project a model's gradient update onto these concepts. This reveals what the model is learning (e.g., "pirate speak" vs. "arithmetic") and allows for selectively amplifying or suppressing specific learning directions.
Instead of pure academic exploration, Goodfire tests state-of-the-art interpretability techniques on customer problems. The shortcomings and failures they encounter directly inform their fundamental research priorities, ensuring their work remains commercially relevant.
Goodfire AI defines interpretability broadly, focusing on applying research to high-stakes production scenarios like healthcare. This strategy aims to bridge the gap between theoretical understanding and the practical, real-world application of AI models.
A novel training method involves adding an auxiliary task for AI models: predicting the neural activity of a human observing the same data. This "brain-augmented" learning could force the model to adopt more human-like internal representations, improving generalization and alignment beyond what simple labels can provide.