METR's Time Horizon Evals Exclude "Messy" Real-World Tasks Lacking Automatic Gradability

Related Insights

Evaluating AI Success in Subjective Fields Is the Technology's Hardest Unsolved Problem

AI excels where success is quantifiable (e.g., code generation). Its greatest challenge lies in subjective domains like mental health or education. Progress requires a messy, societal conversation to define 'success,' not just a developer-built technical leaderboard.

AI: The new frontier for mental health support?

Masters of Scale·8 months ago

AI Models Excel on Benchmarks But Fail in Reality Due to 'Teaching to the Test'

AI models show impressive performance on evaluation benchmarks but underwhelm in real-world applications. This gap exists because researchers, focused on evals, create reinforcement learning (RL) environments that mirror test tasks. This leads to narrow intelligence that doesn't generalize, a form of human-driven reward hacking.

Dwarkesh and Ilya Sutskever on What Comes After Scaling

The a16z Show·7 months ago

AI Models Ace Benchmarks But Fail at Simple Real-World Tasks

There's a significant gap between AI performance in simulated benchmarks and in the real world. Despite scoring highly on evaluations, AIs in real deployments make "silly mistakes that no human would ever dream of doing," suggesting that current benchmarks don't capture the messiness and unpredictability of reality.

Can Grok and Claude run a business? We just did it

AI Pod by Wes Roth and Dylan Curious | Artificial Intelligence News and Interviews With Experts·6 months ago

A Task's 'Messiness' Predicts AI Failure Independently of Human Completion Time

Human time to completion is a strong predictor of AI success, but it's not perfect. METR's analysis found that a task's qualitative 'messiness'—how clean and simple it is versus tricky and rough—also independently predicts whether an AI will succeed. This suggests that pure task length doesn't capture all aspects of difficulty for AIs.

47 - David Rein on METR Time Horizons

AXRP - the AI X-risk Research Podcast·6 months ago

AI Benchmarks Overstate Real-World Gains; Developers Were Slowed by AI in an RCT

There's a significant gap between AI performance on structured benchmarks and its real-world utility. A randomized controlled trial (RCT) found that open-source software developers were actually slowed down by 20% when using AI assistants, despite being miscalibrated to believe the tools were helping. This highlights the limitations of current evaluation methods.

47 - David Rein on METR Time Horizons

AXRP - the AI X-risk Research Podcast·6 months ago

METR Measures AI Agency By Timing Expert Humans, Factoring Out Prior Knowledge

To isolate for agency rather than just knowledge, METR's 'time horizon' metric measures how long tasks take for human experts who already possess the required background knowledge. This methodology aims to reconcile why models can be 'geniuses' on knowledge-intensive tasks (like IMO problems) but 'idiots' on simple, multi-step actions.

47 - David Rein on METR Time Horizons

AXRP - the AI X-risk Research Podcast·6 months ago

Creating Benchmarks Is the True Bottleneck to Complex AI Capabilities

AI struggles with long-horizon tasks not just due to technical limits, but because we lack good ways to measure performance. Once effective evaluations (evals) for these capabilities exist, researchers can rapidly optimize models against them, accelerating progress significantly.

Brendan Foody on Teaching AI and the Future of Knowledge Work

Conversations with Tyler·6 months ago

Evaluating AI on Benchmarks Alone Is as Flawed as Judging Students by Standardized Tests

Just as standardized tests fail to capture a student's full potential, AI benchmarks often don't reflect real-world performance. The true value comes from the 'last mile' ingenuity of productization and workflow integration, not just raw model scores, which can be misleading.

DreamWorks & the Science of Storytelling | Jeffrey Katzenberg & ChenLi Wang, WndrCo

Sourcery·6 months ago

Meaningful AI Benchmarks Are Evolving From Abstract Scores to Practical Task Completion

Traditional AI benchmarks are seen as increasingly incremental and less interesting. The new frontier for evaluating a model's true capability lies in applied, complex tasks that mimic real-world interaction, such as building in Minecraft (MC Bench) or managing a simulated business (VendingBench), which are more revealing of raw intelligence.

Google Gemini 3 reactions, Google Antigravity, Anthropic-Nvidia-Microsoft Deal | Diet TBPN

TBPN·8 months ago

Measuring AI on Multi-Week Human Tasks Is Becoming Prohibitively Expensive

A major challenge for the 'time horizon' metric is its cost. As AI capabilities improve, the tasks needed to benchmark them grow from hours to weeks or months. The cost of paying human experts for these long durations to establish a baseline becomes extremely high, threatening the long-term viability of this evaluation method.

47 - David Rein on METR Time Horizons

AXRP - the AI X-risk Research Podcast·6 months ago

Get your free personalized podcast brief

Related Insights