Will Grok 2 'exceed current [March 28 2024] AI on all metrics'?
145
Ṁ50k
Dec 25
65%
chance

On March 29, Elon Musk tweeted this (https://twitter.com/elonmusk/status/1773655245769330757 ):

"Should be available on 𝕏 next week. Grok 2 should exceed current AI on all metrics. In training now."

Is that so? Let's find out.

Note that for this purpose it counts as 'Grok 2' even if it is renamed - the only way a newly announced xAI model does not count as that is if it is named Grok 1.X, or otherwise is clearly pre-2, but the thing in training now counts whatever they ultimately call it, if they release it, etc.


Resolves YES if Grok 2 is released and it exceeds or ties (to 1 decimal place) Claude 3 Opus and all metrics for models available to the public in some form on or before 3/28/24 on everything in this chart:

So MMLU, GPQA, GSM8K, AMTH, MGSM, HumanEval, DROP, Big-Bench-Hard, ARC-Challenge and HellaSwag.

Resolves NO if Grok 2 is released and does NOT exceed or tie these numbers on one or more of these metrics, or if Grok 2 is not released by EOY 2025.

If xAI does not test on all of these metrics, but it succeeds on all metrics that it does test, and there is no way to test on the others, I will use best judgment - if it clearly would have exceeded I will still resolve YES, but by default (or if it would have been close) I will assume they chose which metrics to test on based on results, and be inclined to count that as NO. Will clarify further if this gets a lot of interest, as needed.

Get Ṁ1,000 play money
Sort by:

Are we not making any progress on getting an actual result? Is this still actually unclear?

Seems like we should find a way to wrap this up soon, but the market reflects real uncertainty on the outcome. Can no one find a way to check?

I've moved the deadline forward to 12/24/24, as a 'this is when I just use my best judgment' date.

@ZviMowshowitz Both Drop and MGSM are considered to be saturated by OpenAI (https://github.com/openai/simple-evals) and there seem to be significant problems with DROP (https://huggingface.co/blog/open-llm-leaderboard-drop).

Taken literally, I'm at 35% on this -- sheer randomness suggests underperformance on one of these benchmarks.

In the spirit of the question, this is quite likely "yes" - even lmsys (style control) shows Grok at LLama 3.1 405 level, which seems to beat Opus on any metric I can find.

Beating both Drop and Gsm8k may be difficult. First benchmark is known to not be steadily progressing on the GPT4 series; second is so close to saturation randomness affects results.

Note that resolving this question might require running tests on GPT-4-0125-preview.

bought Ṁ50 NO

Good thing I didnt put in more, really didnt think youd compare a model released to models from 5 months ago

I will be holding off resolution for a bit...

What would make you resolve this?

I think we either need a source for tests we can trust, or enough other confidence? Clearly with this at 62% we have neither of these things...

Okay, they are planning to release an API, then we can run the evaluations ourselves

Particularly relevant are the three words after the title, "Of course not."

According to the xAI's own benchmark, Grok 2 does not beat Claude 3.5 Opus.

EDIT: SORRY I MEAN SONNET (counterintuitive since the smaller anthropic model does better)

bought Ṁ150 YES

Was not available before 3/28

3.5 Sonnet was released June, so yes you're right

What if Grok 2 is bad but gets much better within the year? https://x.com/rohanpaul_ai/status/1823814591157297567
Some think the current thing is "early"

I would suggest to hold off resolution until it was independently verified that xAIs claims are true. Elon musk has a long history of lying about everything.

Cope

I am not invested in the market

bought Ṁ1,490 YES

It beats opus on every benchmark that is still tested (some of them in the image are outdated like GSM8K) https://x.ai/blog/grok-2

bought Ṁ100 YES from 86% to 89%
bought Ṁ100 YES

How do you adjudicate differences in evaluations across models? e.g. "0-shot CoT" vs "4-shot" on MATH in the table? Does Grok 2 have to report the same evaluation type as Claude 3 for each benchmark?

opened a Ṁ100 NO at 79% order

@ZviMowshowitz FYI these choices frequently flip the order of model performance.

bought Ṁ100 NO

@ZviMowshowitz Will you resolve this question NO, if X-Twitter fails to release Grok 2 by some deadline?

@HankyUSA Explicitly yes EOY 2025.

bought Ṁ10 YES

I think there's a pretty good shot grok 2 gets released in a year and half behind whatever openAI and Anthropic are up to at that point, but exceeds the current standard