What is Grok 4 Heavy's performance on METR's task length evaluation?
14
Ṁ745
2026
5%
0 to 1.5 Hours
48%
1.5 to 2 Hours
25%
2 to 2.5 Hours
15%
2.5 to 3 Hours
7%
More than 3 Hours

Resolves based on the METR's measurement of the duration of tasks that can complete with a 50% success rate.

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

Grok 4 Market here:

https://manifold.markets/AffineTyped/what-is-grok-4s-performance-on-metr

Get Ṁ1,000 play money
Sort by:
bought Ṁ20 1.5 to 2 Hours NO

They’re not gonna measure it i reckon