While AI labs tout performance on standardized tests like math olympiads, these metrics often don't correlate with real-world usefulness or qualitative user experience. Users may prefer a model like Anthropic's Claude for its conversational style, a factor not measured by benchmarks.

Related Insights

Once AI coding agents reach a high performance level, objective benchmarks become less important than a developer's subjective experience. Like a warrior choosing a sword, the best tool is often the one that has the right "feel," writes code in a preferred style, and integrates seamlessly into a human workflow.

The proliferation of AI leaderboards incentivizes companies to optimize models for specific benchmarks. This creates a risk of "acing the SATs" where models excel on tests but don't necessarily make progress on solving real-world problems. This focus on gaming metrics could diverge from creating genuine user value.

AI excels where success is quantifiable (e.g., code generation). Its greatest challenge lies in subjective domains like mental health or education. Progress requires a messy, societal conversation to define 'success,' not just a developer-built technical leaderboard.

AI models show impressive performance on evaluation benchmarks but underwhelm in real-world applications. This gap exists because researchers, focused on evals, create reinforcement learning (RL) environments that mirror test tasks. This leads to narrow intelligence that doesn't generalize, a form of human-driven reward hacking.

Don't trust academic benchmarks. Labs often "hill climb" or game them for marketing purposes, which doesn't translate to real-world capability. Furthermore, many of these benchmarks contain incorrect answers and messy data, making them an unreliable measure of true AI advancement.

As models mature, their core differentiator will become their underlying personality and values, shaped by their creators' objective functions. One model might optimize for user productivity by being concise, while another optimizes for engagement by being verbose.

The best AI models are trained on data that reflects deep, subjective qualities—not just simple criteria. This "taste" is a key differentiator, influencing everything from code generation to creative writing, and is shaped by the values of the frontier lab.

Instead of generic benchmarks, Superhuman tests its AI models against specific problem "dimensions" like deep search and date comprehension. It uses "canonical queries," including extreme edge cases from its CEO, to ensure high quality on tasks that matter most to demanding users.

Standardized AI benchmarks are saturated and becoming less relevant for real-world use cases. The true measure of a model's improvement is now found in custom, internal evaluations (evals) created by application-layer companies. Progress for a legal AI tool, for example, is a more meaningful indicator than a generic test score.

For subjective outputs like image aesthetics and face consistency, quantitative metrics are misleading. Google's team relies heavily on disciplined human evaluations, internal 'eyeballing,' and community testing to capture the subtle, emotional impact that benchmarks can't quantify.