https://lmarena.ai/?leaderboard
Resolves YES if Grok 3.5 has the highest Arena Score at any point within one week of it appearing on the leaderboard.
Update 2025-05-30 (PST) (AI summary of creator comment): The market's close date will be extended if Grok 3.5 has not appeared on the Chatbot Arena leaderboard by the current close date. This extension is to allow for Grok 3.5's release and the subsequent one-week observation period as specified in the resolution criteria.
Update 2025-06-30 (PST) (AI summary of creator comment): The market will resolve to NO if Grok 4 is released and Grok 3.5 is not released.
Reopen until release (A bunch of slop - looking news outlets said post july 4th
Does this resolve yes even if it's named grok 4?
I think grok 4 counting or not could go both ways and both ways to cut it seem reasonable to me. Grok 4 is really just grok 3.5 renamed and trained a bit longer than they were probably originally planning, but nothing in the description indicates that if they skip 3.5 then 4 counts. i am open to suggestions
@Bayesian I mean, can it resolve no? we don't know if it would have topped the leaderboard. If this is N/A or YES that's just bad market dynamics but N/A is bad and YES is weird.
I think the comment under me bought NO because it's grok 4 (that's how I understood his purchase with the tweet picture)
I have a very small position here, the whales should fight it out.
@Bayesian disclaimer I have a small no position, but surely on a very literal common sense view, if it becomes clear 3.5 will never be released, it should resolve no. A model which does not exist and will never exist clearly cannot top the leaderboard.

Meowdy! Grok 3.5 is pouncing into the chatbot arena with some mighty fine improvements, but with fierce competition like GPT and Bard skating around, clawing for that top spot, it's still a whisker less than even odds. Iโd say it has a solid chance, but topping the leaderboard? Hmm, maybe not nyet-yet! places 10 mana limit order on NO at 45% :3
@FergusArgyll Except they re releasing it in the next weeks so they are pretty limited in terms of that trick
@Bayesian In the literal sense maybe, metaphorically, they A/B test 30 different system prompts / fine tunes and only release the one that does. Of course they can simply not have the goods, but it's good enough for 43%