From Minutes to Months: Measuring AI’s Expanding Task Horizons

Spanish

^{Comunicación e interacción con tecnología}

^{Thanakorn Lappattaranan}

Redacción HC
11/09/2025

Artificial intelligence is advancing at a pace that challenges how we measure its true capabilities. A recent preprint by the Model Evaluation & Threat Research (METR) group in collaboration with Anthropic proposes a novel way to evaluate AI systems: instead of relying on abstract benchmarks, they measure the length of human-equivalent tasks that AI models can reliably complete.

The study, titled Measuring AI Ability to Complete Long Tasks (Kwa et al., 2025), introduces the concept of a “time horizon of task completion”—a metric that links AI performance to human work duration. This approach provides an intuitive and practical framework for policymakers, companies, and researchers to assess what kinds of real-world tasks are within reach of automation today, and which still demand human oversight.

Why Task Duration Matters in AI Benchmarking

Traditional benchmarks—such as multiple-choice accuracy or coding tests—often fail to capture how AI performs across tasks of varying complexity and duration. The METR team highlights that while models may excel at short, isolated tasks, their performance often declines when faced with multi-step projects requiring consistency over hours or days.

By measuring human time-to-completion as a reference, researchers can translate model performance into a language that resonates with industry and governance: hours, days, or weeks of human labor. For instance, if a model can reliably complete tasks equivalent to 50 minutes of human work, it provides a clear threshold for which activities may be safely automated.

Methodology: 170 Tasks, 13 Models, and Psychometric Curves

To test this framework, researchers designed a suite of 170 tasks covering software engineering and research activities. These included existing benchmarks such as HCAST and RE-Bench, as well as a newly introduced SWAA (Software Atomic Actions) dataset focusing on short, atomic tasks.

Key steps included:

Measuring the time experts typically needed for each task.
Recording success rates of AI agents (model plus scaffolding tools).
Fitting logistic curves inspired by psychometrics to estimate the probability of model success as a function of task duration.

The critical metric—known as the “50% time horizon”—marks the point at which a model has a 50% chance of completing a task of a given length.

Robustness checks included validation on external datasets like SWE-bench Verified, sensitivity analyses using hierarchical bootstrapping, and assessments of “messiness” in task design (how real-world complexity impacts performance).

Findings: Doubling Every 7 Months

The study’s most striking finding is the exponential growth of AI task horizons. Since 2019, the length of tasks AI can complete with 50% reliability has doubled roughly every seven months.

Some highlights include:

Near-perfect performance on tasks under 4 minutes of human work.
Sharp drop-offs on tasks requiring more than 4 hours, with success rates falling below 10%.
Claude 3.7 Sonnet, a cutting-edge model, achieved horizons of ~50 minutes.

Extrapolations suggest that if this trend continues, frontier models could handle month-long projects between 2028 and 2031. Importantly, even when accounting for large systematic errors (e.g., misestimating human times by a factor of 10), the forecasts only shift by about two years, underscoring the robustness of the approach.

Implications for Policy, Labor, and Industry

The practical consequences of this metric are profound:

Governance and regulation: Policymakers could set thresholds on model deployment based on time horizons. For example, restricting access to models capable of independently completing multi-day tasks in sensitive domains such as infrastructure.
Workforce planning: By quantifying how quickly automation may extend into longer-duration tasks, governments and organizations can better prepare for shifts in technical and research labor markets.
Business strategy: Companies can use horizon metrics to decide where automation is viable. Short, repetitive coding tasks may already be automatable, while complex multi-day projects still require human-in-the-loop systems.

The authors caution, however, against overreliance on extrapolation. They recommend expanding benchmarks to capture more realistic, context-rich tasks before using projections for prescriptive decision-making.

Conclusion: Tracking AI’s Expanding Reach

The METR study reframes how we measure AI progress, shifting from abstract test scores to real-world task horizons. By grounding model capabilities in human-equivalent work durations, it provides a tool for researchers, companies, and policymakers to better anticipate the impacts of rapid AI advancement.

As AI systems approach horizons measured in days or weeks, the need for thoughtful governance and workforce adaptation becomes increasingly urgent. The key takeaway: AI is not just getting “smarter”—it is learning to sustain performance over longer stretches of human-equivalent labor.

Topics of interest

Technology

References:
Kwa T, West B, Becker J, Deng A, Garcia K, Hasin M, Jawhar S, Kinniment M, Rush N, Von Arx S, Bloom R, Broadley T, Du H, Goodrich B, Jurkovic N, Miles LH, Nix S, Lin T, Parikh N, Rein D, Sato LJK, Wijk H, Ziegler DM, Barnes E, Chan L. Measuring AI ability to complete long tasks [Internet]. arXiv; 2025 Mar [cited 2025 Sep 6]. Available from: https://ar5iv.org/abs/2503.14499

Model Evaluation & Threat Research (METR). Measuring AI ability to complete long tasks [Internet]. METR; 2025 Mar 19 [cited 2025 Sep 6]. Available from: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

License