We scan new podcasts and send you the top 5 insights daily.
We can now prove that LLMs are not just correlating tokens but are developing sophisticated internal world models. Techniques like sparse autoencoders untangle the network's dense activations, revealing distinct, manipulable concepts like "Golden Gate Bridge." This conclusively demonstrates a deeper, conceptual understanding within the models.
A useful mental model for an LLM is a giant matrix where each row is a possible prompt and columns represent next-token probabilities. This matrix is impossibly large but also extremely sparse, as most token combinations are gibberish. The LLM's job is to efficiently compress and approximate this matrix.
The complexity in LLMs isn't intelligence emerging in silicon; it reflects our own. These models are deep because they encode the vast, causally powerful structure of human language and culture. We are looking at a high-resolution imprint of our own collective mind.
Human understanding is the ability to connect new information to a global, unified model of the universe. Until recently, AI models were isolated (e.g., a chess model). The major advance with large multimodal models is their ability to create a single, cohesive reality model, enabling true, generalizable understanding.
Analysis of models' hidden 'chain of thought' reveals the emergence of a unique internal dialect. This language is compressed, uses non-standard grammar, and contains bizarre phrases that are already difficult for humans to interpret, complicating safety monitoring and raising concerns about future incomprehensibility.
The field is moving beyond labeling concepts with sparse autoencoders. The new frontier is understanding the intricate geometric structures (manifolds) these concepts form in a model's latent space and how circuits transform them, providing a more unified, dynamic view.
Just as biology deciphers the complex systems created by evolution, mechanistic interpretability seeks to understand the "how" inside neural networks. Instead of treating models as black boxes, it examines their internal parameters and activations to reverse-engineer how they work, moving beyond just measuring their external behavior.
Contrary to fears, interpretability techniques for Transformers seem to work well on new architectures like Mamba and Mixture-of-Experts. These architectures may even offer novel "affordances," such as interpretable routing paths in MoEs, that could make understanding models easier, not harder.
Language models work by identifying subtle, implicit patterns in human language that even linguists cannot fully articulate. Their success broadens our definition of "knowledge" to include systems that can embody and use information without the explicit, symbolic understanding that humans traditionally require.
Goodfire AI found that for certain tasks, simple classifiers trained on a model's raw activations performed better than those using features from Sparse Autoencoders (SAEs). This surprising result challenges the assumption that SAEs always provide a cleaner concept space.
Using a sparse autoencoder to identify active concepts, one can project a model's gradient update onto these concepts. This reveals what the model is learning (e.g., "pirate speak" vs. "arithmetic") and allows for selectively amplifying or suppressing specific learning directions.