Early benchmark improvements focused on adding more languages and repositories. Now, the cutting edge involves creating more difficult evaluation splits through sophisticated curation techniques. Researchers must justify why their new benchmark is qualitatively harder, not just broader, than existing ones.

Related Insights

The primary bottleneck in improving AI is no longer data or compute, but the creation of 'evals'—tests that measure a model's capabilities. These evals act as product requirement documents (PRDs) for researchers, defining what success looks like and guiding the training process.

Traditional AI benchmarks are seen as increasingly incremental and less interesting. The new frontier for evaluating a model's true capability lies in applied, complex tasks that mimic real-world interaction, such as building in Minecraft (MC Bench) or managing a simulated business (VendingBench), which are more revealing of raw intelligence.

When models achieve suspiciously high scores, it raises questions about benchmark integrity. Intentionally including impossible problems in benchmarks can serve as a flag to test an AI's ability to recognize unsolvable requests and refuse them, a crucial skill for real-world reliability and safety.

As reinforcement learning (RL) techniques mature, the core challenge shifts from the algorithm to the problem definition. The competitive moat for AI companies will be their ability to create high-fidelity environments and benchmarks that accurately represent complex, real-world tasks, effectively teaching the AI what matters.

Counterintuitively, GitHub discovered that training coding models on more private enterprise codebases (e.g., modern web frameworks) provides little benefit. The significant performance gains come from training on scarce, legacy code like COBOL, where public data is limited but enterprise demand for modernization is high.

Traditional, static benchmarks for AI models go stale almost immediately. The superior approach is creating dynamic benchmarks that update constantly based on real-world usage and user preferences, which can then be turned into products themselves, like an auto-routing API.

An analysis of AI model performance shows a 2-2.5x improvement in intelligence scores across all major players within the last year. This rapid advancement is leading to near-perfect scores on existing benchmarks, indicating a need for new, more challenging tests to measure future progress.

Since AI assistants make it easy for candidates to complete take-home coding exercises, simply evaluating the final product is no longer an effective screening method. The new best practice is to require candidates to build with AI and then explain their thought process, revealing their true engineering and problem-solving skills.

Instead of generic benchmarks, Superhuman tests its AI models against specific problem "dimensions" like deep search and date comprehension. It uses "canonical queries," including extreme edge cases from its CEO, to ensure high quality on tasks that matter most to demanding users.

Current benchmarks like SWE-bench test isolated, independent tasks. The new Code Clash benchmark aims to evaluate long-horizon development by having AI models compete in a tournament, continuously improving their own codebases in response to competitive pressure from other models.