At Goodfire AI, a "Member of Technical Staff" is a highly generalist role. Early employees must be "switch hitters," tackling a mix of research, engineering, and product development, highlighting the need for versatile talent in deep-tech startups.
Instead of pure academic exploration, Goodfire tests state-of-the-art interpretability techniques on customer problems. The shortcomings and failures they encounter directly inform their fundamental research priorities, ensuring their work remains commercially relevant.
Research suggests a formal equivalence between modifying a model's internal activations (steering) and providing prompt examples (in-context learning). This framework could potentially create a formula to convert between the two techniques, even for complex behaviors like jailbreaks.
RLHF is criticized as a primitive, sample-inefficient way to align models, like "slurping feedback through a straw." The goal of interpretability-driven design is to move beyond this, enabling expert feedback that explains *why* a behavior is wrong, not just that it is.
Even when a model performs a task correctly, interpretability can reveal it learned a bizarre, "alien" heuristic that is functionally equivalent but not the generalizable, human-understood principle. This highlights the challenge of ensuring models truly "grok" concepts.
Goodfire AI defines interpretability broadly, focusing on applying research to high-stakes production scenarios like healthcare. This strategy aims to bridge the gap between theoretical understanding and the practical, real-world application of AI models.
Goodfire frames interpretability as the core of the AI-human interface. One direction is intentional design, allowing human control. The other, especially with superhuman scientific models, is extracting novel knowledge (e.g., new Alzheimer's biomarkers) that the AI discovers.
Public demos of activation steering often focus on simple, stylistic changes (e.g., "Gen Z mode"). The speakers acknowledge a major research frontier is bridging this gap to achieve sophisticated behaviors like legal reasoning, which requires more advanced interventions.
Goodfire AI found that for certain tasks, simple classifiers trained on a model's raw activations performed better than those using features from Sparse Autoencoders (SAEs). This surprising result challenges the assumption that SAEs always provide a cleaner concept space.
A helpful mental model distinguishes parameter-space edits from activation-space edits. Fine-tuning with LoRA alters model weights (the "pipes"), while activation steering modifies the information flowing through them (the "water"), clarifying two distinct approaches to model control.
In partnership with institutions like Mayo Clinic, Goodfire applied interpretability tools to specialized foundation models. This process successfully identified new, previously unknown biomarkers for Alzheimer's, showcasing how understanding a model's internals can lead to tangible scientific breakthroughs.
When building a PII detector for e-commerce giant Rakuten, Goodfire AI had to train on synthetic data due to privacy rules. This forced them to solve the difficult "synthetic to real" transfer problem to ensure performance on actual customer data, a common enterprise hurdle.
