Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

The Labs team intentionally builds products that are non-functional or unsafe with current AI models to serve as future benchmarks. This 'bad' product acts as a consistent testbed to measure progress and signal to the research team when a new model has finally crossed a critical capability threshold, making the product viable.

Related Insights

To avoid saturated evaluations that only confirm existing capabilities, Notion's team creates difficult test suites they expect to fail 70% of the time. This "headroom" provides a clear signal to model providers about frontier needs and helps the team anticipate where the technology is heading.

When building at the frontier of AI, it's a valid strategy to ship imperfect, "vibe-coded" features. This approach assumes that rapid, near-future model improvements will clean up imperfections, making it better to launch an imperfect product now rather than wait for perfect model performance that is just around the corner.

Anthropic prototypes features like code review even when model accuracy is too low for a public launch. This allows them to identify what's missing and be ready to immediately swap in a new, more capable model to close the gap and launch ahead of competitors.

OpenAI's evals team is looking beyond current benchmarks that test self-contained, hour-long tasks. They are calling for new evaluations that measure performance on problems that would take top engineers weeks or months to solve, such as creating entire products end-to-end. This signals a major increase in the complexity and ambition expected from future AI benchmarks.

The most sophisticated benchmarks, like Arc AGI, are not meant to be a permanent 'final exam' for AI. They are designed as moving targets that are expected to become saturated and obsolete. This forces researchers to constantly focus on the next most important unsolved problem at the AI frontier.

The primary bottleneck in improving AI is no longer data or compute, but the creation of 'evals'—tests that measure a model's capabilities. These evals act as product requirement documents (PRDs) for researchers, defining what success looks like and guiding the training process.

Anthropic's core product team was too small to explore frontier AI applications, focusing instead on incremental updates. The Labs division was created specifically to build next-generation products that could showcase the exponential growth of their AI models, ensuring the product roadmap kept pace with the technology curve.

In the rapidly advancing field of AI, building products around current model limitations is a losing strategy. The most successful AI startups anticipate the trajectory of model improvements, creating experiences that seem 80% complete today but become magical once future models unlock their full potential.

Traditional, static benchmarks for AI models go stale almost immediately. The superior approach is creating dynamic benchmarks that update constantly based on real-world usage and user preferences, which can then be turned into products themselves, like an auto-routing API.

The innovation team operates on two principles. First, they identify and close the gap between what current AI models can do and how people actually use them. Second, they imagine what models will be good at in six months and start building the products for that future state today.