Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

The tasks in METR's Time Horizon chart are not representative of all AI work. They are selected for being automatically gradable and neatly scoped, deliberately excluding "messy," open-ended, or vision-dependent tasks common in the real world. This selection bias is a key limitation when interpreting the chart's predictions.

Related Insights

AI excels where success is quantifiable (e.g., code generation). Its greatest challenge lies in subjective domains like mental health or education. Progress requires a messy, societal conversation to define 'success,' not just a developer-built technical leaderboard.

AI models show impressive performance on evaluation benchmarks but underwhelm in real-world applications. This gap exists because researchers, focused on evals, create reinforcement learning (RL) environments that mirror test tasks. This leads to narrow intelligence that doesn't generalize, a form of human-driven reward hacking.

There's a significant gap between AI performance in simulated benchmarks and in the real world. Despite scoring highly on evaluations, AIs in real deployments make "silly mistakes that no human would ever dream of doing," suggesting that current benchmarks don't capture the messiness and unpredictability of reality.

Human time to completion is a strong predictor of AI success, but it's not perfect. METR's analysis found that a task's qualitative 'messiness'—how clean and simple it is versus tricky and rough—also independently predicts whether an AI will succeed. This suggests that pure task length doesn't capture all aspects of difficulty for AIs.

There's a significant gap between AI performance on structured benchmarks and its real-world utility. A randomized controlled trial (RCT) found that open-source software developers were actually slowed down by 20% when using AI assistants, despite being miscalibrated to believe the tools were helping. This highlights the limitations of current evaluation methods.

To isolate for agency rather than just knowledge, METR's 'time horizon' metric measures how long tasks take for human experts who already possess the required background knowledge. This methodology aims to reconcile why models can be 'geniuses' on knowledge-intensive tasks (like IMO problems) but 'idiots' on simple, multi-step actions.

AI struggles with long-horizon tasks not just due to technical limits, but because we lack good ways to measure performance. Once effective evaluations (evals) for these capabilities exist, researchers can rapidly optimize models against them, accelerating progress significantly.

Just as standardized tests fail to capture a student's full potential, AI benchmarks often don't reflect real-world performance. The true value comes from the 'last mile' ingenuity of productization and workflow integration, not just raw model scores, which can be misleading.

Traditional AI benchmarks are seen as increasingly incremental and less interesting. The new frontier for evaluating a model's true capability lies in applied, complex tasks that mimic real-world interaction, such as building in Minecraft (MC Bench) or managing a simulated business (VendingBench), which are more revealing of raw intelligence.

A major challenge for the 'time horizon' metric is its cost. As AI capabilities improve, the tasks needed to benchmark them grow from hours to weeks or months. The cost of paying human experts for these long durations to establish a baseline becomes extremely high, threatening the long-term viability of this evaluation method.

METR's Time Horizon Evals Exclude "Messy" Real-World Tasks Lacking Automatic Gradability | RiffOn