
Will there be an LLM which scores above what a human can do in 2 hours on METR's eval suite before 2026?
Mini
6
Ṁ3692026
82%
chance
1D
1W
1M
ALL
METR has found that current frontier models get a score on their autonomy benchmark roughly similar to a human who is given 30 minutes. Will at least one model score at the level of a human given 2 hours by 2026?

Clarifications:
I will try to resolve this market in accordance with the current task suite. If METR makes the suite harder or easier I will try to account for this in the resolution of this market.
if I am not able to determine the performance of frontier models at the end of 2025, this market will be resolved NA
Get Ṁ1,000 play money
Related questions
Related questions
Will a publicly-available LLM achieve gold on IMO before 2026?
33% chance
Will an LLM be able to solve the Self-Referential Aptitude Test before 2027?
66% chance
Will an LLM improve its own ability along some important metric well beyond the best trained LLMs before 2026?
50% chance
LLM Hallucination: Will an LLM score >90% on SimpleQA before 2026?
60% chance
Will LLMs be able to formally verify non-trivial programs by the end of 2025?
31% chance
Will an LLM report >50% score on ARC in 2025?
85% chance
Will an LLM agent complete >50% of the lab tasks on the Factorio Learning Environment benchmark in 2025?
26% chance
Will one of the major LLMs be capable of continual lifelong learning (learning from inference runs) by EOY 2025?
26% chance
Will there be major breakthrough in LLM Continual Learning before 2026?
36% chance
Will there be any text-based task that most humans can solve, but top LLMs won't? By the end of 2024
95% chance