Sweepstakes are here
IN BETA
This is a sweepstakes market! Trade with Sweepcash for the chance to win real cash prizes.
Will an AI get gold on any International Math Olympiad by the end of 2025?
➕
Plus
0
resolved Dec 9
Resolved
N/A
https://bounded-regret.ghost.io/ai-forecasting-one-year-in/ This is from June - great article on hypermind forecasts for AI progress, and how the progress on the MATH dataset 1 year in was far faster than predicted.
https://ai.facebook.com/blog/ai-math-theorem-proving/
Seems relevant https://aimoprize.com/
Retracted, possibly wrong, possibly embargo-breaking, online article saying that Deepmind systems had hit IMO silver level.
+20%
on
It's over https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/
+30%
on
https://openai.com/index/learning-to-reason-with-llms/ Looks like you don't even need specific math fine-tuning to solve math competitions, you just need non-constant compute time for LLMs (So they spend more time on hard problems)
@AdamK OK, so who's benchmarking o3-mini against the 2024 IMO? We could have results within the week.
+7%
on

In Feb 2022, Paul Christiano wrote: Eliezer and I publicly stated some predictions about AI performance on the IMO by 2025.... My final prediction (after significantly revising my guesses after looking up IMO questions and medal thresholds) was:

I'd put 4% on "For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem" where "hardest problem" = "usually problem #6, but use problem #3 instead if either: (i) problem 6 is geo or (ii) problem 3 is combinatorics and problem 6 is algebra." (Would prefer just pick the hardest problem after seeing the test but seems better to commit to a procedure.)

Maybe I'll go 8% on "gets gold" instead of "solves hardest problem."

Eliezer spent less time revising his prediction, but said (earlier in the discussion):

My probability is at least 16% [on the IMO grand challenge falling], though I'd have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more.  Paul?

EDIT:  I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists.  I'll stand by a >16% probability of the technical capability existing by end of 2025

So I think we have Paul at <8%, Eliezer at >16% for AI made before the IMO is able to get a gold (under time controls etc. of grand challenge) in one of 2022-2025.


Resolves to YES if either Eliezer or Paul acknowledge that an AI has succeeded at this task.

Related market: https://manifold.markets/MatthewBarnett/will-a-machine-learning-model-score-f0d93ee0119b


Update: As noted by Paul, the qualifying years for IMO completion are 2023, 2024, and 2025.

Update 2024-06-21: Description formatting

Update 2024-07-25: Changed title from "by 2025" to "by the end of 2025" for clarity

Get Ṁ1,000 play money
Sort by:

Leading LLMs get <5% scores on USAMO (which selects participants for the IMO): https://arxiv.org/abs/2503.21934

@pietrokc Bear in mind that this market is not limited to LLMs

@JimHays Oh, for sure; and LLMs were never the most likely avenue for this to be achieved.

I just think there is a certain current of thought in the AI hypersphere (which includes a lot of Manifold) for which this result should be a big update.

@pietrokc leading LLMs scored 28 in IMO 2024, and Gold started at 29

@mathvc No, leading AI systems did, but not LLMs.

AlphaGeometry is a neuro-symbolic system made up of a neural language model and a symbolic deduction engine, which work together to find proofs for complex geometry theorems. Akin to the idea of “thinking, fast and slow”, one system provides fast, “intuitive” ideas, and the other, more deliberate, rational decision-making.

AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go.

So while these systems include language models, they are not themselves LLMs, it’s not the language models that are doing the heavy mathematical lifting.

@JimHays I don’t know who taught you all these words but AlphaProof is LLM with tree search on top. Tree search is literally sampling (i.e. one of many ways to use LLM)

There is nothing neurosymbolic in it. It’s just a fancy words used by people who don’t know Lean.

@mathvc Take it up with the creators of the models then. I’m quoting from their blog posts announcing the results. They say that their systems include, but are not limited to, language models.

https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/

@JimHays you confused AlphaProof with AlphaGeometry then. The quotes are about AlphaGeometry, it isn’t LLM. AlphaProof is nothing more than LLM

Wikipedia says:

AlphaProof is an AI model, which couples a pre-trained language model with the AlphaZero reinforcement learning algorithm.

AlphaZero (AZ) is a more generalized variant of the AlphaGo Zero (AGZ) algorithm, and is able to play shogi and chess as well as Go.

I’m not sure why you consider that to be nothing more than an LLM?

@JimHays yes i do because i understand what is “AlphaZero” and implemented it many times myself for combinatorial search.

It’s a sampling technique on top of LLM (or on top of any policy network in RL environment).

There are tons of ways to change sampling from LLM (eg change temperature, best of k, maximum likelihood, etc), it doesn’t turn it into non-LLM

You can bet more on NO instead of arguments

I mean, this conversation isn’t really related to the market, since this market is about AI generally, not LLMs. It doesn’t matter in this context how they are categorized.

@JimHays DeepSeek has paper that gives much more details on LLMs+TreeSearch https://arxiv.org/pdf/2408.08152

I keep coming back to this market and wanting to bet but ultimately deciding against it. IMO takes place in ~July, and as anyone who's been in that milieu knows, solutions go up on various websites pretty much the next day, including many superficial variants where the key idea is the same.

Therefore, the deadline of EoY 2025 is nuts. It's way too long, models available by then will definitely have seen the questions. So this market will come down to whether two random people think the eval was "fair", and there's no way to be >80% confident in that.

I believe this market is much closer to the spirit of "can AI do IMO?": https://manifold.markets/jack/will-an-ai-win-a-gold-medal-on-imo

Will an AI win a gold medal on International Math Olympiad (IMO) 2025?
77% chance. Will an AI score well enough on the 2025 International Mathematics Olympiad (IMO) to earn a gold medal score (top ~50 human performance)? Resolves YES if this result is reported no later than 1 month after IMO 2025 (currently scheduled for July 10-20). The AI must complete this task under the same time limits as human competitors. The AI may receive and output either informal or formal problems and proofs. More details below. Otherwise NO. This is related to https://imo-grand-challenge.github.io/ but with some different rules. Rules: The result must be achieved on the IMO 2025 problemset and be reported by reliable publications no later than 1 month after the end of the IMO contest dates (https://www.imo-official.org/organizers.aspx, so by end of August 20 2025, if the IMO does not reschedule its date. Local timezone at the contest site). The AI has only as much time as a human competitor (4.5 hours for each of the two sets of 3 problems), but there are no other limits on the computational resources it may use during that time. The AI may receive and output either informal (natural language) or formal (e.g. the Lean language) problems as input and proofs as output. The AI cannot query the Internet. The AI must not have access to the problems before being evaluated on them, e.g. the problems cannot be included in the training set. (The deadline of 1 month after the competition is intended to give enough time for results to be finalized and published, while minimizing the chances of any accidental inclusion of the IMO solutions in the training set.) If a gold medal score is achieved on IMO 2024 or an earlier IMO, that would not count for this market.

@pietrokc I don't think Yudkowsky and Christiano are "two random people"

To me, the biggest uncertainties (and the reason why I don't bet) are whether the IMO organizers will select anti-AI problems this year, and whether top AI orgs will invest a lot into this specific benchmark

@Lorenzo I don't know much about either of them so to me they're just random people.

I'm leaning pretty heavily NO on the substance of this question. The scenario I envision is that in Dec 2025 models will definitely be able to gold IMO 2025, because they will have been trained on it. Then, some AI company might allege they "filtered out" IMO 2025 from training (as if this was possible to do reliably). At that point, I don't know how Yudkowsky or Christiano will react. Even worse, this market is biased to resolve YES, because it only requires one of them to say YES.

I'm definitely not >20% sure they'll both give the correct answer of NO in the scenario I described above.

@pietrokc The difference between Eliezer's and Paul's statements is significant here, yeah. "For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem" is very different to "the technical capability existing by end of 2025".

Buuut, the market criteria outside of the quotes is much clearer, and only includes AIs built before the IMO:

"So I think we have Paul at <8%, Eliezer at >16% for AI made before the IMO is able to get a gold (under time controls etc. of grand challenge) in one of 2022-2025. Resolves to YES if either Eliezer or Paul acknowledge that an AI has succeeded at this task."

I'm assuming the market maker will only include Eliezer or Paul saying it achieved that statement, not them saying it achieved IMO by EOY. Still some wiggle room there though.

bought 𝕊100.00 NO

@pietrokc problem says "AI made before the IMO" which I think precludes the its-in-the-training-data condition

@JamesBaker3 @Grizzimo What you say is all fair but I think there is a lot of fuzziness around when an LLM was "made". It's a very long pipeline, and companies are incentivized to fudge things and claim models were finished earlier than they really were if you account for RLHF etc. Like, I think it's very possible that some large company may release a model in October that it claims "has training data cutoff May 2025" (say), and still it has seen the IMO questions somehow.

bought 𝕊200.00 NO

@Lorenzo what would an anti-AI problem selection look like?

@Odoacre I don't know, but I expect them to know. As one potential example, maybe less geometry and more game theory? I think AIs might currently do badly in IMO problems like: there's this contrived game between Alice and Bob, prove that Alice has a winning strategy iff the gameboard side is a power of two

@Odoacre Combinatorics instead of problems for AlphaGeom or AlphaProof?

Surely the AI must earn the gold medal under normal time constraints, right? Otherwise an "AI" that just enumerates theorems of ZFC will eventually solve all the problems.

The AlphaProof blog post (for which I'm unable to find a corresponding paper), which describes an AI system which solved 4 of 6 problems, claims it solved some problems in "minutes" and some in "up to three days".

https://garymarcus.substack.com/p/alphageometry2-impressive-accomplishment
Impressive indeed, however does this count for this market if the problem are translated before being solved by the AI, and the result is not clearly human readable.

bought 𝕊3,000.00 YES

Why there is so much progress this year (and it is just the beginning): https://benjamintodd.substack.com/p/teaching-ai-to-reason-this-years