/
© 2026 RiffOn. All rights reserved.
  1. Latent Space: The AI Engineer Podcast
  2. The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI
The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast · Feb 6, 2026

Goodfire AI pioneers production-grade mechanistic interpretability, using techniques like model steering to design, control, and debug AI safely.

Early AI Lab Hires Must Be Generalists Covering Research, Engineering, and Product

At Goodfire AI, a "Member of Technical Staff" is a highly generalist role. Early employees must be "switch hitters," tackling a mix of research, engineering, and product development, highlighting the need for versatile talent in deep-tech startups.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI thumbnail

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·13 days ago

Goodfire AI's Research Agenda is Driven by Real-World Failures of Existing Methods

Instead of pure academic exploration, Goodfire tests state-of-the-art interpretability techniques on customer problems. The shortcomings and failures they encounter directly inform their fundamental research priorities, ensuring their work remains commercially relevant.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI thumbnail

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·13 days ago

Activation Steering and In-Context Learning Might Be Formally Equivalent

Research suggests a formal equivalence between modifying a model's internal activations (steering) and providing prompt examples (in-context learning). This framework could potentially create a formula to convert between the two techniques, even for complex behaviors like jailbreaks.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI thumbnail

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·13 days ago

Reinforcement Learning Is Like Teaching a Child Only With Rewards, Lacking True Intentionality

RLHF is criticized as a primitive, sample-inefficient way to align models, like "slurping feedback through a straw." The goal of interpretability-driven design is to move beyond this, enabling expert feedback that explains *why* a behavior is wrong, not just that it is.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI thumbnail

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·13 days ago

Superhuman AI Models Can Learn Alien Heuristics Instead of Human-Understood Principles

Even when a model performs a task correctly, interpretability can reveal it learned a bizarre, "alien" heuristic that is functionally equivalent but not the generalizable, human-understood principle. This highlights the challenge of ensuring models truly "grok" concepts.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI thumbnail

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·13 days ago

Goodfire AI Pushes Interpretability From Research Labs to High-Stakes Production Use Cases

Goodfire AI defines interpretability broadly, focusing on applying research to high-stakes production scenarios like healthcare. This strategy aims to bridge the gap between theoretical understanding and the practical, real-world application of AI models.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI thumbnail

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·13 days ago

Interpretability Is a Bi-Directional Interface: Humans Control AI, AI Teaches Humans

Goodfire frames interpretability as the core of the AI-human interface. One direction is intentional design, allowing human control. The other, especially with superhuman scientific models, is extracting novel knowledge (e.g., new Alzheimer's biomarkers) that the AI discovers.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI thumbnail

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·13 days ago

Interpretability Steering Must Bridge the Gap From Stylistic Tweaks to Complex Reasoning

Public demos of activation steering often focus on simple, stylistic changes (e.g., "Gen Z mode"). The speakers acknowledge a major research frontier is bridging this gap to achieve sophisticated behaviors like legal reasoning, which requires more advanced interventions.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI thumbnail

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·13 days ago

Interpretability Probes on Raw Activations Can Outperform Advanced Sparse Autoencoder (SAE) Methods

Goodfire AI found that for certain tasks, simple classifiers trained on a model's raw activations performed better than those using features from Sparse Autoencoders (SAEs). This surprising result challenges the assumption that SAEs always provide a cleaner concept space.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI thumbnail

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·13 days ago

Model Editing Analogy: LoRA Modifies the "Pipes," While Steering Modifies the "Water"

A helpful mental model distinguishes parameter-space edits from activation-space edits. Fine-tuning with LoRA alters model weights (the "pipes"), while activation steering modifies the information flowing through them (the "water"), clarifying two distinct approaches to model control.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI thumbnail

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·13 days ago

Goodfire AI Discovered Novel Alzheimer's Biomarkers Using Interpretability on Foundation Models

In partnership with institutions like Mayo Clinic, Goodfire applied interpretability tools to specialized foundation models. This process successfully identified new, previously unknown biomarkers for Alzheimer's, showcasing how understanding a model's internals can lead to tangible scientific breakthroughs.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI thumbnail

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·13 days ago

Enterprise AI Faces a "Synthetic to Real" Data Gap Due to Customer Privacy Constraints

When building a PII detector for e-commerce giant Rakuten, Goodfire AI had to train on synthetic data due to privacy rules. This forced them to solve the difficult "synthetic to real" transfer problem to ensure performance on actual customer data, a common enterprise hurdle.

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI thumbnail

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast·13 days ago