RiffOn - METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

METR's Joel Becker discusses their "Time Horizon" eval for AI capabilities, threat models, and the continuous but astonishing pace of AI progress.

METR Assesses AI Risk by Fusing Model Evaluation with Threat Research

METR, an independent research group, combines two disciplines: Model Evaluation (ME) to understand AI capabilities and propensities, and Threat Research (TR) to connect those findings to specific threat models. This structured, dual approach allows them to assess whether AI poses catastrophic risks to society.

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

Latent Space: The AI Engineer Podcast·18 hours ago

METR's Influential AI Time Horizon Chart Emerged From a Vague PowerPoint Concept

The widely-cited Time Horizon chart, which plots AI capabilities over time, began as a scattered, conceptual graph in an internal METR presentation. The team was surprised to discover a remarkably straight, predictable trendline when they plotted actual data, making its regularity an unexpected and powerful finding.

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

Latent Space: The AI Engineer Podcast·18 hours ago

Slower Compute Growth Could Halve AI Progress by Also Slowing Algorithmic Discovery

A slowdown in compute growth may have a squared negative effect on AI progress. It not only reduces resources for training larger models but also stifles the discovery of new algorithms, as breakthroughs like the Transformer required immense compute for experimentation. This double impact could significantly delay major capabilities milestones.

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

Latent Space: The AI Engineer Podcast·18 hours ago

Replicating AI Developer Productivity Studies Is Harder as Top Engineers Refuse AI-Disallowed Work

METR's influential study on AI developer productivity is now difficult to replicate. As AI tools become more powerful, developers are unwilling to be randomized into a control group where AI use is forbidden. This selection bias makes it increasingly impractical to measure true productivity gains with the original study design.

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

Latent Space: The AI Engineer Podcast·18 hours ago

Top Manifold Trader Won By Influencing Charity Outcomes, Not By Predicting Them

Joel Becker became the most profitable trader on Manifold Markets not through superior forecasting, but by practicing "high-agency trading." He bet on a market predicting charity donation totals and then personally made donations to ensure the outcome he bet on would occur, demonstrating how prediction markets can be manipulated by participants' actions.

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

Latent Space: The AI Engineer Podcast·18 hours ago

METR's Time Horizon Evals Exclude "Messy" Real-World Tasks Lacking Automatic Gradability

The tasks in METR's Time Horizon chart are not representative of all AI work. They are selected for being automatically gradable and neatly scoped, deliberately excluding "messy," open-ended, or vision-dependent tasks common in the real world. This selection bias is a key limitation when interpreting the chart's predictions.

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

Latent Space: The AI Engineer Podcast·18 hours ago

METR's AI Threat Assessment Relies on Observing "Derpy" Model Behavior, Not Just Metrics

Beyond quantitative benchmarks, METR's assessment of AI's catastrophic risk relies heavily on qualitative evidence. This includes watching model transcripts for "derpy" mistakes, observing their inability to use resources well, and relying on the intuition that a new model is only incrementally more capable than the previously non-dangerous one.

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

Latent Space: The AI Engineer Podcast·18 hours ago

The True Test for Coding AI: Would a Senior Engineer Merge Its Pull Request?

Simply passing unit tests (like in SWE-bench) is a weak signal of a coding AI's usefulness. A far better evaluation is whether a senior engineer would actually merge its solution into the main codebase. This holistic judgment accounts for code patterns, test quality, and architectural consistency, which current benchmarks miss.

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

Latent Space: The AI Engineer Podcast·18 hours ago

Perceived 10x AI Productivity Gains Are Inflated by Lower-Value "Infinite Speedup" Tasks

Developers claiming 10x speedups from AI often aren't 10x faster on their core tasks. Instead, they're tackling new side projects that were previously impossible, creating a perception of "infinite" speedup. However, these new tasks are often less economically valuable, inflating the true productivity gain on business-critical work.

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

Latent Space: The AI Engineer Podcast·18 hours ago

True AI Capabilities Explosion Requires Automating Physical Data Center Chores, Not Just Research

A genuine AI capabilities explosion won't happen just because models can write novel research papers. The bottleneck is the full automation of the R&D loop, which includes a long tail of "messy" real-world tasks like fixing failing GPUs in a data center or managing facility cooling. This physical and logistical grounding is often overlooked.

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

Latent Space: The AI Engineer Podcast·18 hours ago

Future AI Evals Should Use Open-Ended "AI Village" Scenarios to Uncover Real-World Failures

Standard benchmarks are too rigid. The future of model evaluation needs more open-ended, multi-agent scenarios like the "AI Village" project. Giving agents broad goals like "organize an event" reveals more about their "derpy" failure modes and real-world capabilities than constrained, benchmark-style tasks can capture.

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

Latent Space: The AI Engineer Podcast·18 hours ago

Get your free personalized podcast brief

Get your free personalized podcast brief