MATH-500

reasoning

MATH-500 evaluates mathematical problem-solving on 500 competition-level problems spanning algebra, geometry, number theory, counting, and probability from AMC, AIME, and other competitions.

View paper / source

Models Tested

98.0

Best Score

88.9

Average Score

0–100

Scale Range

1.3x

Weight

How It Works

Models must solve competition-level mathematics problems and produce exact final answers. Problems require multi-step reasoning, creative problem-solving, and precise calculation. Answers are verified exactly — no partial credit.

Why It Matters

Mathematical reasoning is a core capability for AI systems. Competition-level problems require chained logical reasoning that cannot be solved by pattern matching alone, making this a strong test of genuine reasoning ability.

Limitations

Focuses on competition-style maths which may not reflect practical mathematical ability. Models may have seen similar problems in training data. Does not test ability to formulate problems, only solve them.

Leaderboard — MATH-500

#	Model	Provider	Score	Source	Measured
🥇	o3 Pro	OpenAI	98.0	OpenAI	Jun 2025
🥈	R1	DeepSeek	97.3	DeepSeek	Jan 2025
🥉	o3	OpenAI	96.7	OpenAI	Apr 2025
4	o4 Mini	OpenAI	96.3	OpenAI	Apr 2025
5	Grok 4	xAI	95.0	xAI	Jul 2025
6	Qwen3 235B A22B	Alibaba	92.0	Alibaba	Apr 2025
7	Grok 3 Beta	xAI	91.5	xAI	Jun 2025
8	QwQ 32B	Alibaba	90.6	Alibaba	Mar 2025
9	Gemini 2.5 Pro Preview 06-05	Google	90.2	Google	Mar 2025
10	Claude Opus 4	Anthropic	88.7	Anthropic	May 2025
11	Claude Sonnet 4	Anthropic	85.4	Anthropic	May 2025
12	GPT-4.1	OpenAI	83.0	OpenAI	Apr 2025
13	Gemini 2.5 Flash	Google	82.3	Google	May 2025
14	Qwen2.5 72B Instruct	Alibaba	80.0	Alibaba	Sept 2024
15	DeepSeek V3	DeepSeek	78.3	DeepSeek	Dec 2024
16	GPT-4o (2024-05-13)	OpenAI	76.6	OpenAI	May 2024

All Benchmarks