If a large language models beats a super grandmaster (Classic elo of above 2,700) while playing blind chess by 2028, this market resolves to YES.
I will ignore fun games, at my discretion. (Say a game where Hiraku loses to ChatGPT because he played the Bongcloud)
Some clarification (28th Mar 2023): This market grew fast with a unclear description. My idea is to check whether a general intelligence can play chess, without being created specifically for doing so (like humans aren't chess playing machines). Some previous comments I did.
1- To decide whether a given program is a LLM, I'll rely in the media and the nomenclature the creators give to it. If they choose to call it a LLM or some term that is related, I'll consider. Alternatively, a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model.
2- The model can write as much as it want to reason about the best move. But it can't have external help beyond what is already in the weights of the model. For example, it can't access a chess engine or a chess game database.
I won't bet on this market and I will refund anyone who feels betrayed by this new description and had open bets by 28th Mar 2023. This market will require judgement.
@GabrielTellez normally the rules are that if an illegal move is made, a time penalty is applied, and if it happens 3 times in a game you lose the game. I would assume that the same rules would apply here.
https://www.fide.com/FIDE/handbook/LawsOfChess.pdf
(Article 7.4)
So I see two main ways this could happen (please chime in if you think I'm missing some):
1. The AGI route, where we see such a huge leap forward in reasoning abilities coming out of LLMs that they are able to talk themselves into grandmaster-level reasoning. I'd put this at around 1%.
2. Someone takes a regular LLM and just includes a bunch of chess games in its training data, specifically in order to create an LLM that can play decent chess. I think it would be easy enough to get a ~2000 ELO LLM this way, and probably with some effort you could get one significantly stronger. The reason I think this isn't super likely to happen (~30%), is that it just wouldn't be that interesting. "Oh, you made an LLM that's also an ML model trained to be decent at chess? Cool? I guess?"
I'm also assuming that if someone makes some hybrid LLM where the language portion recruits a separate logic engine for analytical tasks, this would count as "LLM writes code to build a chess engine, and then uses the chess engine" rather than "LLM plays chess", but I'd put this route at 5-10% so I still think this market is high either way.
I would say:
Prompt engineering gets better, such that the LLM isn't closer to AGI but it is able to talk itself into GM level thinking when explicit steps on how to do so are given.
A large number of games are played. The LLM doesn't have to win consistently, just once, if 10,000 games are played between grandmasters and LLMs the LLM is pretty much guaranteed to win at least once.
A combination of weak versions of all/some of these. I don't expect LLMs to reach AGI level by 2028, nor do I expect prompt engineering or more training data to make current LLMs GM level, nor do I expect 10,000 games to be played, but if LLM reasoning gets twice as good, and prompt engineering gets 20% better, and someone includes more games in the training data and 10 games are played, I think there's pretty good odds the LLM will win at least once, and that feels more likely to me.
3 seems basically impossible to me: if the smartest humans alive could not talk themselves into being chess GMs (which I'm pretty sure they can't, at least without also playing thousands of games) then we're not going to see an LLM do it any time soon.
4 seems most likely to come into play as a component of 2, because why would GMs be spending their time playing thousands of games against an LLM unless that LLM was specifically marketed as being good at chess?
I think the most likely path to 2 is something like "OpenAI develops a self-teaching procedure, and has GPT-Next teach itself chess from books and self-play to prove a point." Once we see how much real novelty comes out in the next generation of LLMs I think we'll have a much clearer picture of where things are headed.
@IsaacCarruthers A group of smart humans can write code to implement and train AlphaZero. Given enough time and scratch space, they could also simulate it by hand a la xkcd.com/505/.
So given unbounded runtime/scratch space and clever prompting, the LLM doesn't need to be any good at chess, just good at writing code. And it seems much more likely that someone will spend a few billion train a specialist software-dev LLM vs. a specialist chessplayer LLM.
@placebo_username yes in my top level I mentioned that I was assuming this would count as "LLM builds and then uses a chess engine" rather than "LLM plays chess"
@IsaacCarruthers Not quite. My point is that the logic engine could be implemented by the LLM itself within the language portion instead of being a separate subsystem accessed via queries.
I strongly doubt today's LLMs could beat a super grand master even 1% of the time in hyper bullet. I just timed how long it would take to generate some responses from the gpt-4o api. Here is the transcript:
System: "you are a chess super grand master. you will be provided a chess move and you will say what you think the best follow up is. provide no reasoning or preamble, but only the move."
Me: "e4"
LLM: "e5" (took 2.49 seconds)
Me: "Nf3"
LLM: "Nc6" (took 2.14 seconds)
Me: "Bb5"
LLM: "a6" (took 2.18 seconds)
Me: "Ba4"
LLM: "Nf6" (took 2.47 seconds)
Me: "Nc3"
LLM: "b5" (took 3.66 seconds)
Me: "h3"
LLM: "Be7" (took 2.39 seconds, if it were a 15+0 game, the LLM would have flagged here)
and this is without having it explain it's reasoning or giving it the current board state instead of a list of moves, which I did so that the inputs and responses will be short so it can run quickly, which has the trade-off of meaning it will play much worse than it would if it took the time to "see" the whole board and show some reasoning. unless the problem is that my internet is super slow, I messed up my api calling code, or that I should be using a slower but less accurate model, then no, there is no chance a super gm could ever lose a game against an under 1800 elo opponent who times out on the 6th move, it's just not gonna happen.
I assume you’d use a smaller model for this. Something like LLaMa 8B can get responses significantly faster, right? Fine-tune it on chess data and it could probably get you to 1800 Elo.
(That said I think it’s kind of a cheap way to resolve the market. Computers are obviously faster than humans, and I bet I could make a bot that could beat any human in “extreme hyper bullet chess” with 1 second time for each side. IMO it should be required to be at least 10 minutes per side or something)
@AdamK limit order at 65% for 10k NO shares if you want to go get it.
Clarification: "Alternatively, a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model."
How does finetuning / other forms of specialization fit in here? e.g. Suppose I took a corpus of grandmaster games and finetuned GPT4 on them and called it ChessGPT. Would that be "qualified as a large language model"?
Maybe a more extreme example: Suppose chess.com engineers finetune GPT-7 on millions of super GM games and then explicitly market it as an "AI chess engine" on their site.
I think it would be fair to say that an LM finetuned on chess data is still an LM. That's mostly likely what happened during the post-training of ChatGPT3.5-Turbo to improve its chess ability. As long as the chess-specific post-training compute is a small proportion of pretraining compute, it would be pretty unreasonable to not call the resulting system a language model.
@ShitakiIntaki Not sure what your point is. Finetuning on chess data is not the same as giving the model access to a chess game database during inference.
Fine tuning is a gray area and I don't want to make it appear that I'd be happy if someone fine-tunes Llama 5 or whatever to force a resolution to this market.
The goal is to see if a honest to God general AI can play chess well.
Now let's address your question.
RAG is obviously forbidden. But it's really fuzzy how you define fine tuning, right? If the AI Lab fine-tuned for chess it counts, but if it was a 3rd party, it doesn't? No way.
That's why the "chess engine" classification comes handy. If you fine tune so much that you can't hold a conversation about something other than chess, likely people would call it a chess engine. I would.
Let's see. If someone makes a LLM that doesn't know the capital of France but that can beat Fabiano Caruana while playing blind chess, and don't call it a chess engine, we see what we do.