AI Model Benchmarks
Compare 30 models across 17 benchmarks Standardised tests that measure specific AI capabilities — knowledge, reasoning, coding, maths, and human preference. Each benchmark tests a different skill. . Scores sourced from official model cards Technical documents published alongside a model release, detailing its capabilities, limitations, and benchmark results. , technical reports, and LMSYS Chatbot Arena A crowdsourced platform where real users compare two anonymous AI models side-by-side and vote for the better one. Over 2 million votes make it the most trusted human preference benchmark. .
| Model▲ | Provider | BigCodeBench | HumanEval | LiveCodeBench | SWE-bench Verified | Arena-Hard | Chatbot Arena ELO | MT-Bench | IFEval | MMLU | MMLU-Pro | MMMU | AIME 2025 | ARC Challenge | GPQA Diamond | Humanity's Last Exam | LiveBench | MATH-500 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Claude 3.5 Haiku | Anthropic | — | — | — | — | — | 1260 | — | — | — | — | — | — | — | — | — | — | — |
| Claude Opus 4 | Anthropic | — | 95.0 | — | 72.5 | — | 1330 | — | — | 89.0 | — | — | — | — | 72.1 | — | — | 88.7 |
| Claude Opus 4.6 | Anthropic | 72.0 | — | — | 78.0★ | 90.0 | 1365 | 9.4 | — | — | — | — | — | — | — | 22.0 | 86.5 | — |
| Claude Sonnet 4 | Anthropic | — | 93.0 | — | 53.6 | — | 1310 | 8.8 | — | 88.0 | — | — | — | — | 67.5 | — | — | 85.4 |
| Claude Sonnet 4.6 | Anthropic | — | — | — | 72.0 | 86.0 | 1350 | — | — | — | — | — | — | — | — | — | 83.0 | — |
| Command A | Cohere | — | — | — | — | — | 1280 | — | — | — | — | — | — | — | — | — | — | — |
| DeepSeek V3 | DeepSeek | — | 89.5 | — | — | — | 1275 | — | — | 88.5 | — | — | — | — | 59.1 | — | — | 78.3 |
| DeepSeek V3 0324 | DeepSeek | — | 91.0 | — | — | — | 1310 | — | — | 89.5 | — | — | — | — | — | — | — | — |
| Gemini 2.0 Flash | — | — | — | — | — | 1270 | — | — | — | — | — | — | — | — | — | — | — | |
| Gemini 2.5 Flash | — | 88.5 | — | — | — | 1300 | — | — | 86.5 | — | — | — | — | 59.2 | — | — | 82.3 | |
| Gemini 2.5 Pro Preview 06-05 | 68.0 | 93.2 | — | 63.8 | 85.5 | 1340 | 9.2 | — | 90.5 | — | — | 86.7 | — | 68.4 | — | 82.0 | 90.2 | |
| Gemini 3.1 Pro Preview | — | — | — | — | — | 1375★ | — | — | — | — | — | — | — | — | 25.0 | — | — | |
| GPT-4.1 | OpenAI | — | 93.4 | — | 54.6 | — | 1283 | — | — | 90.2 | — | — | — | — | 66.3 | — | — | 83.0 |
| GPT-4o (extended) | OpenAI | — | 90.2 | — | — | — | 1285 | 8.6 | — | 88.7 | — | — | — | — | 53.6 | — | — | 76.6 |
| GPT-4o-mini | OpenAI | — | — | — | — | — | 1240 | — | — | — | — | — | — | — | — | — | — | — |
| GPT-5 | OpenAI | — | — | — | 75.0 | — | 1355 | — | — | — | — | — | — | — | 86.0 | — | — | — |
| GPT-5.2 | OpenAI | 73.0 | — | — | 78.0★ | 92.0★ | 1370 | 9.5★ | — | — | — | — | — | — | 89.0★ | 24.0 | 88.0★ | — |
| Grok 3 Beta | xAI | — | 93.8 | — | — | — | 1329 | — | — | 91.0 | — | — | 83.9 | — | 68.2 | — | — | 91.5 |
| Grok 4 | xAI | — | — | — | — | 89.0 | 1345 | 9.3 | — | — | — | — | — | — | 82.0 | 21.0 | 84.0 | 95.0 |
| Llama 3.3 70B Instruct | Meta | — | — | — | — | — | 1250 | — | — | — | — | — | — | — | — | — | — | — |
| Llama 4 Maverick | Meta | — | 87.5 | — | — | — | 1290 | 8.5 | — | 85.5 | — | — | — | — | 56.0 | — | — | — |
| Mistral Large | Mistral | — | — | — | — | — | 1295 | — | — | — | — | — | — | — | — | — | — | — |
| Mistral Small 3.1 24B | Mistral | — | — | — | — | — | 1235 | — | — | — | — | — | — | — | — | — | — | — |
| o3 | OpenAI | 74.0★ | 97.0★ | — | 69.1 | 88.5 | 1337 | 9.1 | — | 92.0★ | — | — | 91.6 | — | 83.3 | — | 85.0 | 96.7 |
| o3 Pro | OpenAI | — | — | — | 73.0 | — | — | — | — | — | — | — | 96.7★ | — | 87.5 | 26.6★ | — | 98.0★ |
| o4 Mini | OpenAI | — | 96.0 | — | — | — | — | — | — | — | — | — | 92.7 | — | 81.4 | — | — | 96.3 |
| Qwen2.5 72B Instruct | Alibaba | — | 86.6 | — | — | — | 1245 | — | — | 86.1 | — | — | — | — | 49.0 | — | — | 80.0 |
| Qwen3 235B A22B | Alibaba | — | — | — | — | — | 1320 | — | — | — | — | — | — | — | — | — | 78.0 | 92.0 |
| QwQ 32B | Alibaba | — | 88.0 | — | — | — | — | — | — | — | — | — | 79.5 | — | 63.0 | — | — | 90.6 |
| R1 | DeepSeek | 65.0 | 92.5 | — | — | 82.0 | 1318 | 8.9 | — | 90.8 | — | — | 79.8 | — | 71.5 | 18.0 | 80.0 | 97.3 |
Benchmark Descriptions
Challenging code generation tasks with complex function calls and libraries
Competitive programming from live contests
Automated benchmark using GPT-4 as judge on challenging Arena questions
LMSYS Chatbot Arena crowdsourced ELO rating
Multi-turn conversation benchmark judged by GPT-4
Massive Multitask Language Understanding — 57 academic subjects
Harder version of MMLU with 10 answer choices
Massive Multi-discipline Multimodal Understanding
American Invitational Mathematics Examination 2025
AI2 Reasoning Challenge — grade school science
Graduate-level science questions, expert-validated
Ultra-hard questions from experts across 100+ academic subjects
Contamination-free benchmark using recent questions from math, coding, and reasoning