Contrary to fears, interpretability techniques for Transformers seem to work well on new architectures like Mamba and Mixture-of-Experts. These architectures may even offer novel "affordances," such as interpretable routing paths in MoEs, that could make understanding models easier, not harder.
The field is moving beyond labeling concepts with sparse autoencoders. The new frontier is understanding the intricate geometric structures (manifolds) these concepts form in a model's latent space and how circuits transform them, providing a more unified, dynamic view.
Trying to simply block a model from learning an undesirable behavior is futile; gradient descent will find a way around the obstacle. Truly effective techniques must alter the loss landscape so the model naturally "wants" to learn the desired behavior.
To reduce hallucinations, Goodfire runs a detection probe on a frozen copy of a model, not the live one being trained. This makes it computationally harder for the model to learn to evade the detector than to simply learn not to hallucinate, addressing a key failure mode in AI safety.
Instead of only analyzing a fully trained model, "intentional design" seeks to control what a model learns during training. The goal is to shape the loss landscape to produce desired behaviors and generalizations from the outset, moving from archaeology to architecture.
Using a sparse autoencoder to identify active concepts, one can project a model's gradient update onto these concepts. This reveals what the model is learning (e.g., "pirate speak" vs. "arithmetic") and allows for selectively amplifying or suppressing specific learning directions.
By analyzing a model predicting Alzheimer's, Goodfire discovered it relied on the length of cell-free DNA fragments—a previously overlooked signal. This demonstrates how interpretability can extract new, testable scientific hypotheses from high-performing "black box" models.
Instead of a low-touch SaaS product, Goodfire's business model involves high-value, seven-figure consulting engagements. They work directly with large organizations in finance, government, and life sciences to apply bespoke interpretability and intentional design techniques to specific, high-stakes problems.
Goodfire is cautious about immediately publishing all findings in sensitive areas like intentional design. This isn't just for commercial reasons, but for safety. If a research path proves dangerous, not having published every step allows the community a "line of retreat" from pursuing a harmful direction.
A model's ability to understand a user's mental state is crucial for helpfulness but also enables sycophancy. Effective alignment must surgically intervene in the specific circuit where this capability is misused for people-pleasing, rather than crudely removing the entire useful 'theory of mind' capacity.
Research shows it's possible to distinguish and remove model weights used for memorizing facts versus those for general reasoning. Surprisingly, pruning these memorization weights can improve a model's performance on some reasoning tasks, suggesting a path toward creating more efficient, focused AI reasoners.
