METR, an independent research group, combines two disciplines: Model Evaluation (ME) to understand AI capabilities and propensities, and Threat Research (TR) to connect those findings to specific threat models. This structured, dual approach allows them to assess whether AI poses catastrophic risks to society.
The widely-cited Time Horizon chart, which plots AI capabilities over time, began as a scattered, conceptual graph in an internal METR presentation. The team was surprised to discover a remarkably straight, predictable trendline when they plotted actual data, making its regularity an unexpected and powerful finding.
A slowdown in compute growth may have a squared negative effect on AI progress. It not only reduces resources for training larger models but also stifles the discovery of new algorithms, as breakthroughs like the Transformer required immense compute for experimentation. This double impact could significantly delay major capabilities milestones.
METR's influential study on AI developer productivity is now difficult to replicate. As AI tools become more powerful, developers are unwilling to be randomized into a control group where AI use is forbidden. This selection bias makes it increasingly impractical to measure true productivity gains with the original study design.
Joel Becker became the most profitable trader on Manifold Markets not through superior forecasting, but by practicing "high-agency trading." He bet on a market predicting charity donation totals and then personally made donations to ensure the outcome he bet on would occur, demonstrating how prediction markets can be manipulated by participants' actions.
The tasks in METR's Time Horizon chart are not representative of all AI work. They are selected for being automatically gradable and neatly scoped, deliberately excluding "messy," open-ended, or vision-dependent tasks common in the real world. This selection bias is a key limitation when interpreting the chart's predictions.
Beyond quantitative benchmarks, METR's assessment of AI's catastrophic risk relies heavily on qualitative evidence. This includes watching model transcripts for "derpy" mistakes, observing their inability to use resources well, and relying on the intuition that a new model is only incrementally more capable than the previously non-dangerous one.
Simply passing unit tests (like in SWE-bench) is a weak signal of a coding AI's usefulness. A far better evaluation is whether a senior engineer would actually merge its solution into the main codebase. This holistic judgment accounts for code patterns, test quality, and architectural consistency, which current benchmarks miss.
Developers claiming 10x speedups from AI often aren't 10x faster on their core tasks. Instead, they're tackling new side projects that were previously impossible, creating a perception of "infinite" speedup. However, these new tasks are often less economically valuable, inflating the true productivity gain on business-critical work.
A genuine AI capabilities explosion won't happen just because models can write novel research papers. The bottleneck is the full automation of the R&D loop, which includes a long tail of "messy" real-world tasks like fixing failing GPUs in a data center or managing facility cooling. This physical and logistical grounding is often overlooked.
Standard benchmarks are too rigid. The future of model evaluation needs more open-ended, multi-agent scenarios like the "AI Village" project. Giving agents broad goals like "organize an event" reveals more about their "derpy" failure modes and real-world capabilities than constrained, benchmark-style tasks can capture.
