Leaderboard
Best AI for Math
Models ranked by mathematical reasoning performance across MATH-500, GSM8K, and AIME 2025. Composite score averages available math benchmarks.
| # | Model | Provider | Score | MATH-500 | AIME 2025 | GSM8K | Input $/M | Open Source |
|---|---|---|---|---|---|---|---|---|
| 1 | O3 Pro | OpenAI | 97 | 98 | 96.7 | — | $20 | Proprietary |
| 2 | O4 Mini | OpenAI | 95 | 96.3 | 92.7 | — | $1.1 | Proprietary |
| 3 | Grok 4 | xAI | 95 | 95 | — | — | $3 | Proprietary |
| 4 | O3 | OpenAI | 94 | 96.7 | 91.6 | — | $2 | Proprietary |
| 5 | Qwen3 235B A22B | Alibaba | 92 | 92 | — | — | $0.455 | ✓ Open |
| 6 | DeepSeek R1 | DeepSeek | 89 | 97.3 | 79.8 | — | $0.7 | ✓ Open |
| 7 | Claude Opus 4 | Anthropic | 89 | 88.7 | — | — | $15 | Proprietary |
| 8 | Grok 3 Beta | xAI | 88 | 91.5 | 83.9 | — | $3 | Proprietary |
| 9 | Gemini 2.5 Pro | 88 | 90.2 | 86.7 | — | $1.25 | Proprietary | |
| 10 | Claude Sonnet 4 | Anthropic | 85 | 85.4 | — | — | $3 | Proprietary |
| 11 | QwQ 32B | Alibaba | 85 | 90.6 | 79.5 | — | $0.15 | ✓ Open |
| 12 | GPT-4.1 | OpenAI | 83 | 83 | — | — | $2 | Proprietary |
| 13 | Gemini 2.5 Flash | 82 | 82.3 | — | — | $0.3 | Proprietary | |
| 14 | Qwen2.5 72B Instruct | Alibaba | 80 | 80 | — | — | $0.12 | ✓ Open |
| 15 | DeepSeek V3 | DeepSeek | 78 | 78.3 | — | — | $0.32 | ✓ Open |
| 16 | GPT-4o (extended) | OpenAI | 77 | 76.6 | — | — | $6 | Proprietary |
| 17 | GPT-5.2 | OpenAI | 45 | — | — | — | $1.75 | Proprietary |
| 18 | Claude Opus 4.6 | Anthropic | 45 | — | — | — | $15 | Proprietary |
| 19 | GPT-5 | OpenAI | 44 | — | — | — | $1.25 | Proprietary |
| 20 | Claude Sonnet 4.6 | Anthropic | 43 | — | — | — | $3 | Proprietary |
| 21 | DeepSeek V3.2 | DeepSeek | 43 | — | — | — | $0.2 | ✓ Open |
| 22 | Mistral Large | Mistral | 43 | — | — | — | $2 | ✓ Open |
| 23 | Claude 3.5 Haiku | Anthropic | 41 | — | — | — | $0.8 | Proprietary |
| 24 | Command A | Cohere | 41 | — | — | — | $2.5 | ✓ Open |
| 25 | Gemini 2.0 Flash | 41 | — | — | — | $0.1 | Proprietary | |
| 26 | GPT-4o-mini | OpenAI | 40 | — | — | — | $0.15 | Proprietary |
| 27 | Command R+ (08-2024) | Cohere | 40 | — | — | — | $2.5 | ✓ Open |
| 28 | Llama 3.3 70B Instruct | Meta | 40 | — | — | — | $0.1 | ✓ Open |
| 29 | GPT-5 Nano | OpenAI | 39 | — | — | — | $0.05 | Proprietary |
| 30 | Gemini 2.5 Flash Lite | 39 | — | — | — | $0.1 | Proprietary | |
| 31 | Nova Pro 1.0 | Amazon | 39 | — | — | — | $0.8 | Proprietary |
| 32 | Gemini 2.0 Flash Lite | 38 | — | — | — | $0.075 | Proprietary | |
| 33 | Mistral Small 3.1 24B | Mistral | 38 | — | — | — | $0.35 | ✓ Open |
| 34 | GPT-4.1 Nano | OpenAI | 38 | — | — | — | $0.1 | Proprietary |
| 35 | Llama 4 Maverick | Meta | 38 | — | — | — | $0.15 | ✓ Open |
| 36 | Reka Flash 3 | Reka | 37 | — | — | — | $0.1 | Proprietary |
| 37 | Sonar | Perplexity | 37 | — | — | — | $1 | Proprietary |
| 38 | Command R (08-2024) | Cohere | 37 | — | — | — | $0.15 | ✓ Open |
| 39 | Mistral Nemo | Mistral | 36 | — | — | — | $0.02 | ✓ Open |
| 40 | Nova Lite 1.0 | Amazon | 36 | — | — | — | $0.06 | Proprietary |
Composite score averages available math benchmarks (MATH-500, GSM8K, AIME 2025). Models without benchmark data use a quality-weighted estimate.