What will be the best score on the GPQA benchmark before 2025?
Mini
12
Ṁ2461
Jan 1
82%
chance

This question will resolve as the state-of-the-art accuracy on the GPQA (Diamond) benchmark by an AI system, including any post-training enhancements but excluding any human assistance. This will be based on credible publicly available results prior to January 1st 2025. Credible sources include but are not limited to blog posts, arXiv preprints, and papers.

Background information:

From GPQA, Rein et al,

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof").

Best system on March 15th 2024 is Claude-3 Opus based (Maj@32 5-shot CoT) achieving 59.5%.

Part of the AI Benchmarks series by the AI Safety Student Team at Harvard on evaluations of AI models against technical benchmarks. Full list of questions:

Get Ṁ1,000 play money
Sort by:
bought Ṁ1,600 YES

According to https://openai.com/index/learning-to-reason-with-llms/:

"On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function."

That's 93% if their methodology is allowed for how this question will resolve.