AI Model Benchmarks

Compare 30 models across 17 benchmarks Standardised tests that measure specific AI capabilities — knowledge, reasoning, coding, maths, and human preference. Each benchmark tests a different skill. . Scores sourced from official model cards Technical documents published alongside a model release, detailing its capabilities, limitations, and benchmark results. , technical reports, and LMSYS Chatbot Arena A crowdsourced platform where real users compare two anonymous AI models side-by-side and vote for the better one. Over 2 million votes make it the most trusted human preference benchmark. .

30 models · 17 benchmarks
ModelProviderBigCodeBenchHumanEvalLiveCodeBenchSWE-bench VerifiedArena-HardChatbot Arena ELOMT-BenchIFEvalMMLUMMLU-ProMMMUAIME 2025ARC ChallengeGPQA DiamondHumanity's Last ExamLiveBenchMATH-500
Claude 3.5 HaikuAnthropic1260
Claude Opus 4Anthropic95.072.5133089.072.188.7
Claude Opus 4.6Anthropic72.078.090.013659.422.086.5
Claude Sonnet 4Anthropic93.053.613108.888.067.585.4
Claude Sonnet 4.6Anthropic72.086.0135083.0
Command ACohere1280
DeepSeek V3DeepSeek89.5127588.559.178.3
DeepSeek V3 0324DeepSeek91.0131089.5
Gemini 2.0 FlashGoogle1270
Gemini 2.5 FlashGoogle88.5130086.559.282.3
Gemini 2.5 Pro Preview 06-05Google68.093.263.885.513409.290.586.768.482.090.2
Gemini 3.1 Pro PreviewGoogle137525.0
GPT-4.1OpenAI93.454.6128390.266.383.0
GPT-4o (extended)OpenAI90.212858.688.753.676.6
GPT-4o-miniOpenAI1240
GPT-5OpenAI75.0135586.0
GPT-5.2OpenAI73.078.092.013709.589.024.088.0
Grok 3 BetaxAI93.8132991.083.968.291.5
Grok 4xAI89.013459.382.021.084.095.0
Llama 3.3 70B InstructMeta1250
Llama 4 MaverickMeta87.512908.585.556.0
Mistral LargeMistral1295
Mistral Small 3.1 24BMistral1235
o3OpenAI74.097.069.188.513379.192.091.683.385.096.7
o3 ProOpenAI73.096.787.526.698.0
o4 MiniOpenAI96.092.781.496.3
Qwen2.5 72B InstructAlibaba86.6124586.149.080.0
Qwen3 235B A22BAlibaba132078.092.0
QwQ 32BAlibaba88.079.563.090.6
R1DeepSeek65.092.582.013188.990.879.871.518.080.097.3
★ = best score in column— = not yet measuredScores from official reports and LMSYS Chatbot Arena

Benchmark Descriptions

BigCodeBench coding

Challenging code generation tasks with complex function calls and libraries

Scale: 0–100 Weight: 1.1x View details
HumanEval coding

Python function completion (pass@1)

Scale: 0–100 Weight: 1x View details
LiveCodeBench coding

Competitive programming from live contests

Scale: 0–100 Weight: 1.2x View details

Real-world GitHub issue resolution

Scale: 0–100 Weight: 1.5x View details
Arena-Hard conversational

Automated benchmark using GPT-4 as judge on challenging Arena questions

Scale: 0–100 Weight: 1.2x View details
Chatbot Arena ELO conversational

LMSYS Chatbot Arena crowdsourced ELO rating

Scale: 800–1400 Weight: 1.5x View details
MT-Bench conversational

Multi-turn conversation benchmark judged by GPT-4

Scale: 0–10 Weight: 0.8x View details
IFEval instruction

Instruction Following Evaluation

Scale: 0–100 Weight: 0.8x View details
MMLU knowledge

Massive Multitask Language Understanding — 57 academic subjects

Scale: 0–100 Weight: 0.8x View details
MMLU-Pro knowledge

Harder version of MMLU with 10 answer choices

Scale: 0–100 Weight: 1.2x View details
MMMU multimodal

Massive Multi-discipline Multimodal Understanding

Scale: 0–100 Weight: 1x View details
AIME 2025 reasoning

American Invitational Mathematics Examination 2025

Scale: 0–100 Weight: 1.4x View details
ARC Challenge reasoning

AI2 Reasoning Challenge — grade school science

Scale: 0–100 Weight: 0.6x View details
GPQA Diamond reasoning

Graduate-level science questions, expert-validated

Scale: 0–100 Weight: 1.5x View details

Ultra-hard questions from experts across 100+ academic subjects

Scale: 0–100 Weight: 1.5x View details
LiveBench reasoning

Contamination-free benchmark using recent questions from math, coding, and reasoning

Scale: 0–100 Weight: 1.3x View details
MATH-500 reasoning

Competition-level mathematics problems

Scale: 0–100 Weight: 1.3x View details