Artificial intelligence is advancing at a pace that challenges how we measure its true capabilities. A recent preprint by the Model Evaluation & Threat Research (METR) group in collaboration with Anthropic proposes a novel way to evaluate AI systems: instead of relying on abstract benchmarks, they measure the length of human-equivalent tasks that AI models can reliably complete.
The study, titled Measuring AI Ability to Complete Long Tasks (Kwa et al., 2025), introduces the concept of a “time horizon of task completion”—a metric that links AI performance to human work duration. This approach provides an intuitive and practical framework for policymakers, companies, and researchers to assess what kinds of real-world tasks are within reach of automation today, and which still demand human oversight.
Traditional benchmarks—such as multiple-choice accuracy or coding tests—often fail to capture how AI performs across tasks of varying complexity and duration. The METR team highlights that while models may excel at short, isolated tasks, their performance often declines when faced with multi-step projects requiring consistency over hours or days.
By measuring human time-to-completion as a reference, researchers can translate model performance into a language that resonates with industry and governance: hours, days, or weeks of human labor. For instance, if a model can reliably complete tasks equivalent to 50 minutes of human work, it provides a clear threshold for which activities may be safely automated.
To test this framework, researchers designed a suite of 170 tasks covering software engineering and research activities. These included existing benchmarks such as HCAST and RE-Bench, as well as a newly introduced SWAA (Software Atomic Actions) dataset focusing on short, atomic tasks.
Key steps included:
The critical metric—known as the “50% time horizon”—marks the point at which a model has a 50% chance of completing a task of a given length.
Robustness checks included validation on external datasets like SWE-bench Verified, sensitivity analyses using hierarchical bootstrapping, and assessments of “messiness” in task design (how real-world complexity impacts performance).
The study’s most striking finding is the exponential growth of AI task horizons. Since 2019, the length of tasks AI can complete with 50% reliability has doubled roughly every seven months.
Some highlights include:
Extrapolations suggest that if this trend continues, frontier models could handle month-long projects between 2028 and 2031. Importantly, even when accounting for large systematic errors (e.g., misestimating human times by a factor of 10), the forecasts only shift by about two years, underscoring the robustness of the approach.
The practical consequences of this metric are profound:
The authors caution, however, against overreliance on extrapolation. They recommend expanding benchmarks to capture more realistic, context-rich tasks before using projections for prescriptive decision-making.
The METR study reframes how we measure AI progress, shifting from abstract test scores to real-world task horizons. By grounding model capabilities in human-equivalent work durations, it provides a tool for researchers, companies, and policymakers to better anticipate the impacts of rapid AI advancement.
As AI systems approach horizons measured in days or weeks, the need for thoughtful governance and workforce adaptation becomes increasingly urgent. The key takeaway: AI is not just getting “smarter”—it is learning to sustain performance over longer stretches of human-equivalent labor.
Topics of interest
TechnologyReferences:
Kwa T, West B, Becker J, Deng A, Garcia K, Hasin M, Jawhar S, Kinniment M, Rush N, Von Arx S, Bloom R, Broadley T, Du H, Goodrich B, Jurkovic N, Miles LH, Nix S, Lin T, Parikh N, Rein D, Sato LJK, Wijk H, Ziegler DM, Barnes E, Chan L. Measuring AI ability to complete long tasks [Internet]. arXiv; 2025 Mar [cited 2025 Sep 6]. Available from: https://ar5iv.org/abs/2503.14499
Model Evaluation & Threat Research (METR). Measuring AI ability to complete long tasks [Internet]. METR; 2025 Mar 19 [cited 2025 Sep 6]. Available from: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/