GPQA Diamond

reasoning

GPQA Diamond (Graduate-Level Google-Proof Q&A) contains extremely difficult science questions that even domain experts find challenging. "Diamond" refers to the hardest subset where expert validators achieved less than 65% accuracy.

View paper / source

18

Models Tested

89.0

Best Score

70.2

Average Score

0–100

Scale Range

1.5x

Weight

How It Works

Models answer open-ended graduate-level questions in physics, chemistry, and biology. Questions are designed to be "Google-proof" — they cannot be answered by simply searching the internet. Expert validators with PhDs in the relevant field verify question difficulty.

Why It Matters

GPQA Diamond is one of the most reliable indicators of deep scientific reasoning. Because even domain experts struggle with these questions, high scores genuinely indicate advanced reasoning capabilities rather than memorisation.

Limitations

Small dataset size (~200 questions) means scores can be noisy. Heavy STEM focus doesn't capture humanities or creative reasoning. Expert disagreement on some answers.

Leaderboard — GPQA Diamond

# Model Provider Score
🥇 GPT-5.2 OpenAI 89.0
🥈 o3 Pro OpenAI 87.5
🥉 GPT-5 OpenAI 86.0
4 o3 OpenAI 83.3
5 Grok 4 xAI 82.0
6 o4 Mini OpenAI 81.4
7 Claude Opus 4 Anthropic 72.1
8 R1 DeepSeek 71.5
9 Gemini 2.5 Pro Preview 06-05 Google 68.4
10 Grok 3 Beta xAI 68.2
11 Claude Sonnet 4 Anthropic 67.5
12 GPT-4.1 OpenAI 66.3
13 QwQ 32B Alibaba 63.0
14 Gemini 2.5 Flash Google 59.2
15 DeepSeek V3 DeepSeek 59.1
16 Llama 4 Maverick Meta 56.0
17 GPT-4o (2024-05-13) OpenAI 53.6
18 Qwen2.5 72B Instruct Alibaba 49.0
All Benchmarks