Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

Mike Krieger intentionally disregards initial hype and "toy example" tests for new AI models. He believes a model's true capabilities and value are only revealed after users have integrated it into real-world workflows for a sustained period, discovering its strengths and weaknesses through practical application.

Related Insights

The Labs team intentionally builds products that are non-functional or unsafe with current AI models to serve as future benchmarks. This 'bad' product acts as a consistent testbed to measure progress and signal to the research team when a new model has finally crossed a critical capability threshold, making the product viable.

Despite access to the powerful Fable model, Mike Krieger finds it's "overkill" for simple queries like sports scores. He deliberately uses the faster, less "thoughtful" Sonnet model on his phone, highlighting the need for a "model fleet" approach for different tasks.

The initial reaction to Anthropic's Fable five model suggests its true power is only obvious to experts tackling complex problems. This creates a challenge in demonstrating value to a broader user base, even if benefits for common tasks like strategic thinking exist but are more subtle and harder to immediately recognize.

Many leaders test AI with simple, surface-level experiments. But modern AI is so advanced that these small tests create a false sense of understanding. According to Braze CPO Kevin Wang, genuine value is only revealed when AI is applied to complex, multi-team business problems and real-world workloads.

The debate over whether LLMs are truly "intelligent" is academic. The practical test for product builders is whether the tool produces valuable outputs that lead to better decisions, regardless of the underlying mechanism.

The essential skill for AI PMs is deep intuition, which can only be built through hands-on experimentation. This means actively using every new LLM, image, and video model upon release to objectively understand its capabilities, limitations, and trajectory, rather than relying on second-hand analysis.

Beyond standard benchmarks, Anthropic fine-tunes its models based on their "eagerness." An AI can be "too eager," over-delivering and making unwanted changes, or "too lazy," requiring constant prodding. Finding the right balance is a critical, non-obvious aspect of creating a useful and steerable AI assistant.

While Anthropic's Mythos model is a best-in-class bug-finder, its capabilities are an incremental improvement, not a paradigm shift. Cybersecurity expert Alex Stamos notes the real security Rubicon was crossed last year by multiple models. The narrative of Mythos as a uniquely dangerous AI is therefore more a result of coordinated marketing than a reflection of a singular new threat.

Alex Karp argues that an AI's high score on a single benchmark is irrelevant for enterprise adoption. Real institutions require passing thousands of consecutive, differentiated tests. An AI model that is brilliant at one task but fails at the 50th in a complex sequence is effectively useless.

A new product development principle for AI is to observe the model's "latent demand"—what it attempts to do on its own. Instead of just reacting to user hacks, Anthropic builds tools to facilitate the model's innate tendencies, inverting the traditional user-centric approach.