GPQA Diamond

reasoning

GPQA Diamond (Graduate-Level Google-Proof Q&A) contains extremely difficult science questions that even domain experts find challenging. "Diamond" refers to the hardest subset where expert validators achieved less than 65% accuracy.

View paper / source

Models Tested

89.0

Best Score

70.2

Average Score

0–100

Scale Range

1.5x

Weight

How It Works

Models answer open-ended graduate-level questions in physics, chemistry, and biology. Questions are designed to be "Google-proof" — they cannot be answered by simply searching the internet. Expert validators with PhDs in the relevant field verify question difficulty.

Why It Matters

GPQA Diamond is one of the most reliable indicators of deep scientific reasoning. Because even domain experts struggle with these questions, high scores genuinely indicate advanced reasoning capabilities rather than memorisation.

Limitations

Small dataset size (~200 questions) means scores can be noisy. Heavy STEM focus doesn't capture humanities or creative reasoning. Expert disagreement on some answers.

Leaderboard — GPQA Diamond

#	Model	Provider	Score	Source	Measured
🥇	GPT-5.2	OpenAI	89.0	OpenAI	Dec 2025
🥈	o3 Pro	OpenAI	87.5	OpenAI	Jun 2025
🥉	GPT-5	OpenAI	86.0	OpenAI	Aug 2025
4	o3	OpenAI	83.3	OpenAI	Apr 2025
5	Grok 4	xAI	82.0	xAI	Jul 2025
6	o4 Mini	OpenAI	81.4	OpenAI	Apr 2025
7	Claude Opus 4	Anthropic	72.1	Anthropic	May 2025
8	R1	DeepSeek	71.5	DeepSeek	Jan 2025
9	Gemini 2.5 Pro Preview 06-05	Google	68.4	Google	Mar 2025
10	Grok 3 Beta	xAI	68.2	xAI	Jun 2025
11	Claude Sonnet 4	Anthropic	67.5	Anthropic	May 2025
12	GPT-4.1	OpenAI	66.3	OpenAI	Apr 2025
13	QwQ 32B	Alibaba	63.0	Alibaba	Mar 2025
14	Gemini 2.5 Flash	Google	59.2	Google	May 2025
15	DeepSeek V3	DeepSeek	59.1	DeepSeek	Dec 2024
16	Llama 4 Maverick	Meta	56.0	Meta	Apr 2025
17	GPT-4o (2024-05-13)	OpenAI	53.6	OpenAI	May 2024
18	Qwen2.5 72B Instruct	Alibaba	49.0	Alibaba	Sept 2024

All Benchmarks