Will a large language model beat a super grandmaster playing chess by 2028?
1.3k
Ṁ610k
2029
58%
chance

If a large language models beats a super grandmaster (Classic elo of above 2,700) while playing blind chess by 2028, this market resolves to YES.

I will ignore fun games, at my discretion. (Say a game where Hiraku loses to ChatGPT because he played the Bongcloud)

Some clarification (28th Mar 2023): This market grew fast with a unclear description. My idea is to check whether a general intelligence can play chess, without being created specifically for doing so (like humans aren't chess playing machines). Some previous comments I did.

1- To decide whether a given program is a LLM, I'll rely in the media and the nomenclature the creators give to it. If they choose to call it a LLM or some term that is related, I'll consider. Alternatively, a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model.


2- The model can write as much as it want to reason about the best move. But it can't have external help beyond what is already in the weights of the model. For example, it can't access a chess engine or a chess game database.

I won't bet on this market and I will refund anyone who feels betrayed by this new description and had open bets by 28th Mar 2023. This market will require judgement.

Get Ṁ1,000 play money
Sort by:

Anyone know how good o1 is at chess?

What if the LLM cheats/makes illegal moves?

@GabrielTellez normally the rules are that if an illegal move is made, a time penalty is applied, and if it happens 3 times in a game you lose the game. I would assume that the same rules would apply here.

https://www.fide.com/FIDE/handbook/LawsOfChess.pdf

(Article 7.4)

So I see two main ways this could happen (please chime in if you think I'm missing some):
1. The AGI route, where we see such a huge leap forward in reasoning abilities coming out of LLMs that they are able to talk themselves into grandmaster-level reasoning. I'd put this at around 1%.
2. Someone takes a regular LLM and just includes a bunch of chess games in its training data, specifically in order to create an LLM that can play decent chess. I think it would be easy enough to get a ~2000 ELO LLM this way, and probably with some effort you could get one significantly stronger. The reason I think this isn't super likely to happen (~30%), is that it just wouldn't be that interesting. "Oh, you made an LLM that's also an ML model trained to be decent at chess? Cool? I guess?"
I'm also assuming that if someone makes some hybrid LLM where the language portion recruits a separate logic engine for analytical tasks, this would count as "LLM writes code to build a chess engine, and then uses the chess engine" rather than "LLM plays chess", but I'd put this route at 5-10% so I still think this market is high either way.

I would say:

  1. Prompt engineering gets better, such that the LLM isn't closer to AGI but it is able to talk itself into GM level thinking when explicit steps on how to do so are given.

  2. A large number of games are played. The LLM doesn't have to win consistently, just once, if 10,000 games are played between grandmasters and LLMs the LLM is pretty much guaranteed to win at least once.

  3. A combination of weak versions of all/some of these. I don't expect LLMs to reach AGI level by 2028, nor do I expect prompt engineering or more training data to make current LLMs GM level, nor do I expect 10,000 games to be played, but if LLM reasoning gets twice as good, and prompt engineering gets 20% better, and someone includes more games in the training data and 10 games are played, I think there's pretty good odds the LLM will win at least once, and that feels more likely to me.

3 seems basically impossible to me: if the smartest humans alive could not talk themselves into being chess GMs (which I'm pretty sure they can't, at least without also playing thousands of games) then we're not going to see an LLM do it any time soon.
4 seems most likely to come into play as a component of 2, because why would GMs be spending their time playing thousands of games against an LLM unless that LLM was specifically marketed as being good at chess?

I think the most likely path to 2 is something like "OpenAI develops a self-teaching procedure, and has GPT-Next teach itself chess from books and self-play to prove a point." Once we see how much real novelty comes out in the next generation of LLMs I think we'll have a much clearer picture of where things are headed.

@IsaacCarruthers A group of smart humans can write code to implement and train AlphaZero. Given enough time and scratch space, they could also simulate it by hand a la xkcd.com/505/.

So given unbounded runtime/scratch space and clever prompting, the LLM doesn't need to be any good at chess, just good at writing code. And it seems much more likely that someone will spend a few billion train a specialist software-dev LLM vs. a specialist chessplayer LLM.

@placebo_username yes in my top level I mentioned that I was assuming this would count as "LLM builds and then uses a chess engine" rather than "LLM plays chess"

@IsaacCarruthers Not quite. My point is that the logic engine could be implemented by the LLM itself within the language portion instead of being a separate subsystem accessed via queries.

sold Ṁ652 NO

Still don't think this will happen, but wanted some to do some other trades

Is there any specific time control? I except an LLM could already win in hyper bullet at least once over a 20-50 game series, you could do that today.

There's not. I'll use the fun game rule to consider if a given game was valid.

I strongly doubt today's LLMs could beat a super grand master even 1% of the time in hyper bullet. I just timed how long it would take to generate some responses from the gpt-4o api. Here is the transcript:
System: "you are a chess super grand master. you will be provided a chess move and you will say what you think the best follow up is. provide no reasoning or preamble, but only the move."
Me: "e4"
LLM: "e5" (took 2.49 seconds)
Me: "Nf3"
LLM: "Nc6" (took 2.14 seconds)
Me: "Bb5"
LLM: "a6" (took 2.18 seconds)
Me: "Ba4"
LLM: "Nf6" (took 2.47 seconds)
Me: "Nc3"
LLM: "b5" (took 3.66 seconds)
Me: "h3"
LLM: "Be7" (took 2.39 seconds, if it were a 15+0 game, the LLM would have flagged here)
and this is without having it explain it's reasoning or giving it the current board state instead of a list of moves, which I did so that the inputs and responses will be short so it can run quickly, which has the trade-off of meaning it will play much worse than it would if it took the time to "see" the whole board and show some reasoning. unless the problem is that my internet is super slow, I messed up my api calling code, or that I should be using a slower but less accurate model, then no, there is no chance a super gm could ever lose a game against an under 1800 elo opponent who times out on the 6th move, it's just not gonna happen.

I assume you’d use a smaller model for this. Something like LLaMa 8B can get responses significantly faster, right? Fine-tune it on chess data and it could probably get you to 1800 Elo.

(That said I think it’s kind of a cheap way to resolve the market. Computers are obviously faster than humans, and I bet I could make a bot that could beat any human in “extreme hyper bullet chess” with 1 second time for each side. IMO it should be required to be at least 10 minutes per side or something)

opened a Ṁ3,500 NO at 65% order

@AdamK limit order at 65% for 10k NO shares if you want to go get it.

I think I can get a better price than that, no? Could do 60

I’ll reconsider once I get my liquidity back, waiting for a market to resolve.

Seems like the market has moved such that 65% is a reasonable price, so I won’t be moving it down.

opened a Ṁ2,000 NO at 62% order

I've got a limit order for 2k at 62%

opened a Ṁ4,000 NO at 60% order

10k at 60% now

Clarification: "Alternatively, a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model."

How does finetuning / other forms of specialization fit in here? e.g. Suppose I took a corpus of grandmaster games and finetuned GPT4 on them and called it ChessGPT. Would that be "qualified as a large language model"?

Maybe a more extreme example: Suppose chess.com engineers finetune GPT-7 on millions of super GM games and then explicitly market it as an "AI chess engine" on their site.

I think it would be fair to say that an LM finetuned on chess data is still an LM. That's mostly likely what happened during the post-training of ChatGPT3.5-Turbo to improve its chess ability. As long as the chess-specific post-training compute is a small proportion of pretraining compute, it would be pretty unreasonable to not call the resulting system a language model.

2- The model can write as much as it want to reason about the best move. But it can't have external help beyond what is already in the weights of the model. For example, it can't access a chess engine or a chess game database.

Emphasis on

, it can't access a chess engine or a chess game database.

@ShitakiIntaki Not sure what your point is. Finetuning on chess data is not the same as giving the model access to a chess game database during inference.

I think it'd be fair if it's just fine tuned on chess data but is still quite competent on all the other sorts of stuff ChatGPT can do it should count. If the only thing it can really do is play chess after the fine tuning, it should not count.

Fine tuning is a gray area and I don't want to make it appear that I'd be happy if someone fine-tunes Llama 5 or whatever to force a resolution to this market.

The goal is to see if a honest to God general AI can play chess well.

Now let's address your question.

RAG is obviously forbidden. But it's really fuzzy how you define fine tuning, right? If the AI Lab fine-tuned for chess it counts, but if it was a 3rd party, it doesn't? No way.

That's why the "chess engine" classification comes handy. If you fine tune so much that you can't hold a conversation about something other than chess, likely people would call it a chess engine. I would.

Let's see. If someone makes a LLM that doesn't know the capital of France but that can beat Fabiano Caruana while playing blind chess, and don't call it a chess engine, we see what we do.

Am I understanding it correctly that you’re saying fine tuning by a 3rd party wouldn’t count?