MATH-500

reasoning

MATH-500 evaluates mathematical problem-solving on 500 competition-level problems spanning algebra, geometry, number theory, counting, and probability from AMC, AIME, and other competitions.

View paper / source

16

Models Tested

98.0

Best Score

88.9

Average Score

0–100

Scale Range

1.3x

Weight

How It Works

Models must solve competition-level mathematics problems and produce exact final answers. Problems require multi-step reasoning, creative problem-solving, and precise calculation. Answers are verified exactly — no partial credit.

Why It Matters

Mathematical reasoning is a core capability for AI systems. Competition-level problems require chained logical reasoning that cannot be solved by pattern matching alone, making this a strong test of genuine reasoning ability.

Limitations

Focuses on competition-style maths which may not reflect practical mathematical ability. Models may have seen similar problems in training data. Does not test ability to formulate problems, only solve them.

Leaderboard — MATH-500

# Model Provider Score
🥇 o3 Pro OpenAI 98.0
🥈 R1 DeepSeek 97.3
🥉 o3 OpenAI 96.7
4 o4 Mini OpenAI 96.3
5 Grok 4 xAI 95.0
6 Qwen3 235B A22B Alibaba 92.0
7 Grok 3 Beta xAI 91.5
8 QwQ 32B Alibaba 90.6
9 Gemini 2.5 Pro Preview 06-05 Google 90.2
10 Claude Opus 4 Anthropic 88.7
11 Claude Sonnet 4 Anthropic 85.4
12 GPT-4.1 OpenAI 83.0
13 Gemini 2.5 Flash Google 82.3
14 Qwen2.5 72B Instruct Alibaba 80.0
15 DeepSeek V3 DeepSeek 78.3
16 GPT-4o (2024-05-13) OpenAI 76.6
All Benchmarks