GPQA Diamond
reasoningGPQA Diamond (Graduate-Level Google-Proof Q&A) contains extremely difficult science questions that even domain experts find challenging. "Diamond" refers to the hardest subset where expert validators achieved less than 65% accuracy.
View paper / source18
Models Tested
89.0
Best Score
70.2
Average Score
0–100
Scale Range
1.5x
Weight
How It Works
Models answer open-ended graduate-level questions in physics, chemistry, and biology. Questions are designed to be "Google-proof" — they cannot be answered by simply searching the internet. Expert validators with PhDs in the relevant field verify question difficulty.
Why It Matters
GPQA Diamond is one of the most reliable indicators of deep scientific reasoning. Because even domain experts struggle with these questions, high scores genuinely indicate advanced reasoning capabilities rather than memorisation.
Limitations
Small dataset size (~200 questions) means scores can be noisy. Heavy STEM focus doesn't capture humanities or creative reasoning. Expert disagreement on some answers.
Leaderboard — GPQA Diamond
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | GPT-5.2 | OpenAI | 89.0 | |
| 🥈 | o3 Pro | OpenAI | 87.5 | |
| 🥉 | GPT-5 | OpenAI | 86.0 | |
| 4 | o3 | OpenAI | 83.3 | |
| 5 | Grok 4 | xAI | 82.0 | |
| 6 | o4 Mini | OpenAI | 81.4 | |
| 7 | Claude Opus 4 | Anthropic | 72.1 | |
| 8 | R1 | DeepSeek | 71.5 | |
| 9 | Gemini 2.5 Pro Preview 06-05 | 68.4 | | |
| 10 | Grok 3 Beta | xAI | 68.2 | |
| 11 | Claude Sonnet 4 | Anthropic | 67.5 | |
| 12 | GPT-4.1 | OpenAI | 66.3 | |
| 13 | QwQ 32B | Alibaba | 63.0 | |
| 14 | Gemini 2.5 Flash | 59.2 | | |
| 15 | DeepSeek V3 | DeepSeek | 59.1 | |
| 16 | Llama 4 Maverick | Meta | 56.0 | |
| 17 | GPT-4o (2024-05-13) | OpenAI | 53.6 | |
| 18 | Qwen2.5 72B Instruct | Alibaba | 49.0 | |