MATH-500
reasoningMATH-500 evaluates mathematical problem-solving on 500 competition-level problems spanning algebra, geometry, number theory, counting, and probability from AMC, AIME, and other competitions.
View paper / source16
Models Tested
98.0
Best Score
88.9
Average Score
0–100
Scale Range
1.3x
Weight
How It Works
Models must solve competition-level mathematics problems and produce exact final answers. Problems require multi-step reasoning, creative problem-solving, and precise calculation. Answers are verified exactly — no partial credit.
Why It Matters
Mathematical reasoning is a core capability for AI systems. Competition-level problems require chained logical reasoning that cannot be solved by pattern matching alone, making this a strong test of genuine reasoning ability.
Limitations
Focuses on competition-style maths which may not reflect practical mathematical ability. Models may have seen similar problems in training data. Does not test ability to formulate problems, only solve them.
Leaderboard — MATH-500
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | o3 Pro | OpenAI | 98.0 | |
| 🥈 | R1 | DeepSeek | 97.3 | |
| 🥉 | o3 | OpenAI | 96.7 | |
| 4 | o4 Mini | OpenAI | 96.3 | |
| 5 | Grok 4 | xAI | 95.0 | |
| 6 | Qwen3 235B A22B | Alibaba | 92.0 | |
| 7 | Grok 3 Beta | xAI | 91.5 | |
| 8 | QwQ 32B | Alibaba | 90.6 | |
| 9 | Gemini 2.5 Pro Preview 06-05 | 90.2 | | |
| 10 | Claude Opus 4 | Anthropic | 88.7 | |
| 11 | Claude Sonnet 4 | Anthropic | 85.4 | |
| 12 | GPT-4.1 | OpenAI | 83.0 | |
| 13 | Gemini 2.5 Flash | 82.3 | | |
| 14 | Qwen2.5 72B Instruct | Alibaba | 80.0 | |
| 15 | DeepSeek V3 | DeepSeek | 78.3 | |
| 16 | GPT-4o (2024-05-13) | OpenAI | 76.6 | |