Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

Despite using nearly 100 software engineers to create 'SWE-Bench Verified', the benchmark had significant flaws, like overly narrow tests that demanded specific, unstated implementation choices. These flaws only became apparent when analyzing why highly capable models were failing, showing that model advancements are necessary to debug and stress-test their own evaluations.

Related Insights

AI models show impressive performance on evaluation benchmarks but underwhelm in real-world applications. This gap exists because researchers, focused on evals, create reinforcement learning (RL) environments that mirror test tasks. This leads to narrow intelligence that doesn't generalize, a form of human-driven reward hacking.

A benchmark like SWE-Bench is valuable when models score 20%, but becomes meaningless noise once models achieve 80%+ scores. At that point, improvements reflect guessing arbitrary details (like function names) rather than genuine capability. This demonstrates that benchmarks have a natural lifecycle and must be retired once saturated to avoid misleading progress metrics.

Public leaderboards like LM Arena are becoming unreliable proxies for model performance. Teams implicitly or explicitly "benchmark" by optimizing for specific test sets. The superior strategy is to focus on internal, proprietary evaluation metrics and use public benchmarks only as a final, confirmatory check, not as a primary development target.

OpenAI's evals team is looking beyond current benchmarks that test self-contained, hour-long tasks. They are calling for new evaluations that measure performance on problems that would take top engineers weeks or months to solve, such as creating entire products end-to-end. This signals a major increase in the complexity and ambition expected from future AI benchmarks.

As benchmarks become standard, AI labs optimize models to excel at them, leading to score inflation without necessarily improving generalized intelligence. The solution isn't a single perfect test, but continuously creating new evals that measure capabilities relevant to real-world user needs.

The gap between benchmark scores and real-world performance suggests labs achieve high scores by distilling superior models or training for specific evals. This makes benchmarks a poor proxy for genuine capability, a skepticism that should be applied to all new model releases.

Seemingly simple benchmarks yield wildly different results if not run under identical conditions. Third-party evaluators must run tests themselves because labs often use optimized prompts to inflate scores. Even then, challenges like parsing inconsistent answer formats make truly fair comparison a significant technical hurdle.

Don't trust academic benchmarks. Labs often "hill climb" or game them for marketing purposes, which doesn't translate to real-world capability. Furthermore, many of these benchmarks contain incorrect answers and messy data, making them an unreliable measure of true AI advancement.

Instead of generic benchmarks, Superhuman tests its AI models against specific problem "dimensions" like deep search and date comprehension. It uses "canonical queries," including extreme edge cases from its CEO, to ensure high quality on tasks that matter most to demanding users.

Popular AI coding benchmarks can be deceptive because they prioritize task completion over efficiency. A model that uses significantly more tokens and time to reach a solution is fundamentally inferior to one that delivers an elegant result faster, even if both complete the task.

Even OpenAI's Human-Verified Benchmarks Had Flaws Only Exposed by Superhuman AI | RiffOn