AI Model Benchmarks

Compare 58 models across 34 benchmarks Standardised tests that measure specific AI capabilities — knowledge, reasoning, coding, maths, and human preference. Each benchmark tests a different skill. . Scores sourced from official model cards Technical documents published alongside a model release, detailing its capabilities, limitations, and benchmark results. , technical reports, and LMSYS Chatbot Arena A crowdsourced platform where real users compare two anonymous AI models side-by-side and vote for the better one. Over 2 million votes make it the most trusted human preference benchmark. .

External benchmark explorers and degradation trackers

Use these when you want deeper repeated-eval views, agent harness tracking, and historical performance beyond this site’s current benchmark matrix.

Factuality and hallucination

SimpleQA

OpenAI publication + open benchmark

Short-form factuality benchmark for fact-seeking questions with single verifiable answers. Useful for measuring whether a model answers accurately instead of confidently guessing.

Strong fit for the Reliability Floor factuality dimension, but too narrow to stand alone as a general reliability score.

OpenAI overview Paper GitHub

Grounded long-form factuality

FACTS Grounding

Google DeepMind benchmark + Kaggle leaderboard

Benchmark for whether long-form answers stay faithful to supplied documents while still addressing the user request.

Good fit for the Reliability Floor groundedness dimension because it tests source-grounded answers rather than broad world knowledge alone.

DeepMind overview Leaderboard Paper

Instruction following

IFEval

Paper + Google Research code/data

Verifiable instruction-following benchmark covering constraints such as word counts, required keywords, and formatting rules.

Useful because it avoids purely subjective judging for a core reliability question: did the model do exactly what it was told?

Paper Google Research code

Long-context retrieval

RULER

Paper + open-source code

Long-context benchmark that extends needle-in-haystack into harder retrieval, multi-hop tracing, aggregation, and question-answering tasks.

Use as a retrieval stress test, not as proof of real-world long-context reliability by itself.

Paper GitHub

Real long-context tasks

HELMET

Hugging Face benchmark + paper

Holistic long-context evaluation suite with more application-like tasks including retrieval-augmented generation, citations, summarization, and reranking.

Useful counterweight to simple needle tests because the authors explicitly warn that simple synthetic retrieval does not predict real downstream performance well.

Hugging Face overview Paper GitHub

RAG factual QA

CRAG

NeurIPS paper + GitHub

Comprehensive RAG benchmark for dynamic factual question answering with web and knowledge-graph search simulations.

Best treated as a RAG-system reliability source, because scores depend on retrieval setup as well as the base model.

NeurIPS abstract Paper GitHub

Long-term memory

LongMemEval

OpenReview paper

Benchmark for long-term assistant memory across sustained interactions, including extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.

Useful for the Reliability Floor memory dimension, but memory claims need especially careful evidence labels and limitations.

OpenReview

Fresh objective tasks

LiveBench

Leaderboard + paper + GitHub

Frequently updated objective benchmark covering math, coding, reasoning, language, instruction following, and data analysis.

Useful for fresh-data discipline and objective grading; only the relevant categories should contribute to the Reliability Floor.

Leaderboard Paper GitHub

Holistic model evaluation

HELM

Stanford CRFM benchmark + framework

Stanford CRFM evaluation framework and benchmark collection designed to make model comparisons more transparent across scenarios, metrics, prompts, and adapters.

Useful as a provenance model for the Reliability Floor because it treats prompt, harness, scenario, and metric choices as part of the result rather than invisible background detail.

HELM homepage HELM GitHub Paper

Agent evals and degradation tracking

MarginLab

Public site + docs + GitHub

Open benchmark and eval ecosystem focused on robust, reproducible agent testing, with public degradation trackers and historical views for coding agents.

Useful when you want repeated agent and harness measurement rather than a single static benchmark snapshot.

Homepage Documentation GitHub

58 models · 34 benchmarks

Model▲	Provider	GAIA	TAU-bench	WebArena	Aider Polyglot	BFCL	BigCodeBench	HumanEval	LiveCodeBench	SWE-bench Verified	Arena-Hard	Chatbot Arena ELO	MT-Bench	Creative Writing Bench	FinQA	FinanceBench	LegalBench	MedQA	WildBench Creative	IFEval	MMLU	MMLU-Pro	MGSM	MMMU	MMMU-Pro	MathVista	AIME 2025	ARC Challenge	GPQA Diamond	Humanity's Last Exam	LiveBench	MATH-500	AIR-Bench	SimpleQA	TrustLLM
Claude 3 Opus	Anthropic	—	—	—	—	—	—	84.9	—	—	—	1253	—	—	—	—	—	—	—	—	86.8	—	—	—	—	—	—	—	50.4	—	—	60.1	—	—	—
Claude 3.5 Haiku	Anthropic	—	—	—	—	—	—	88.1	—	—	—	1260	—	—	—	—	—	—	—	—	83.0	—	—	—	—	—	—	—	—	—	—	69.3	—	—	—
Claude 3.5 Sonnet	Anthropic	—	—	—	—	—	—	93.7	—	49.0	—	1285	8.7	—	—	—	—	—	—	—	88.7	—	—	—	—	—	—	—	65.0	—	—	78.3	—	—	—
Claude 3.7 Sonnet	Anthropic	—	—	—	—	—	—	—	—	62.3	—	1310	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	68.0	—	—	86.0	—	—	—
Claude Haiku 4.5	Anthropic	—	—	—	—	—	—	90.0	—	52.0	—	1295	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
Claude Opus 4	Anthropic	—	62.0	38.0	—	—	—	95.0	—	72.5	—	1330	—	88.0	78.0	74.0	82.0	89.0	84.0	—	89.0	—	90.0	—	64.0	—	—	—	72.1	—	—	88.7	—	44.0	—
Claude Opus 4.5	Anthropic	—	—	—	—	—	—	95.5	—	75.0	—	1355	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	78.0	—	—	—	—	—	—
Claude Opus 4.6	Anthropic	75.0	70.0	48.0	82.0★	—	72.0	—	—	78.0	90.0	1365	9.4	92.0★	83.0	80.0	86.0	—	88.0★	—	—	—	—	—	—	—	—	—	—	22.0	86.5	—	—	—	—
Claude Sonnet 4	Anthropic	—	—	—	—	88.0	—	93.0	—	53.6	—	1310	8.8	86.0	74.0	—	79.0	—	82.0	—	88.0	—	—	—	—	—	—	—	67.5	—	—	85.4	—	41.0	—
Claude Sonnet 4.5	Anthropic	—	—	—	—	—	—	—	—	68.0	—	1340	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	72.0	—	—	—	—	—	—
Claude Sonnet 4.6	Anthropic	—	—	—	79.0	—	—	—	—	72.0	86.0	1350	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	83.0	—	—	—	—
Command A	Cohere	—	—	—	—	—	—	85.0	—	—	—	1280	—	—	—	—	—	—	—	—	83.5	—	—	—	—	—	—	—	—	—	—	—	—	—	—
DeepSeek V3	DeepSeek	—	—	—	—	—	—	89.5	—	—	—	1275	—	—	—	—	—	—	—	—	88.5	—	—	—	—	—	—	—	59.1	—	—	78.3	—	—	—
DeepSeek V3 0324	DeepSeek	—	—	—	—	—	—	91.0	—	50.0	—	1310	—	—	—	—	—	—	—	—	89.5	—	—	—	—	—	—	—	62.0	—	—	84.0	—	—	—
Gemini 1.5 Pro	Google	—	—	—	—	—	—	84.1	—	—	—	1260	—	—	—	—	—	—	—	—	85.9	—	—	—	—	—	—	—	—	—	—	67.7	—	—	—
Gemini 2.0 Flash	Google	—	—	—	—	—	—	—	—	—	—	1270	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
Gemini 2.5 Flash	Google	—	—	—	—	—	—	88.5	—	—	—	1300	—	—	—	—	—	—	—	—	86.5	—	—	—	—	—	—	—	59.2	—	—	82.3	—	—	—
Gemini 2.5 Flash Lite	Google	—	—	—	—	—	—	82.0	—	—	—	1240	—	—	—	—	—	—	—	—	80.0	—	—	—	—	—	—	—	—	—	—	—	—	—	—
Gemini 2.5 Pro Preview 06-05	Google	68.0	64.0	42.0	72.0	87.0	68.0	93.2	—	63.8	85.5	1340	9.2	85.0	80.0	77.0	84.0	91.0	82.0	—	90.5	—	92.0	—	68.0	—	86.7	—	68.4	—	82.0	90.2	—	47.0	—
Gemini 3 Flash Preview	Google	—	—	—	—	—	—	89.0	—	55.0	—	1310	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
Gemini 3 Pro	Google	—	—	—	—	—	—	—	—	70.0	—	1362	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	80.0	—	—	—	—	—	—
Gemini 3.1 Pro Preview	Google	—	—	—	—	—	—	—	—	—	—	1375	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	25.0	—	—	—	—	—
GPT-4.1	OpenAI	—	—	—	—	—	—	93.4	—	54.6	—	1283	—	—	—	—	—	—	—	—	90.2	—	—	—	—	—	—	—	66.3	—	—	83.0	—	—	—
GPT-4.1 Mini	OpenAI	—	—	—	—	—	—	90.5	—	42.0	—	1275	—	—	—	—	—	—	—	—	87.5	—	—	—	—	—	—	—	—	—	—	—	—	—	—
GPT-4.5	OpenAI	—	—	—	—	—	—	91.0	—	—	—	1300	8.8	—	—	—	—	—	—	—	90.2	—	—	—	—	—	—	—	62.0	—	—	82.0	—	62.5★	—
GPT-4o (2024-05-13)	OpenAI	—	52.0	30.0	—	85.0	—	90.2	—	—	—	1285	8.6	82.0	72.0	68.0	78.0	86.1	78.0	—	88.7	—	86.5	—	51.0	—	—	—	53.6	—	—	76.6	—	38.2	—
GPT-4o-mini	OpenAI	—	—	—	—	—	—	87.2	—	—	—	1240	—	—	—	—	—	—	—	—	82.0	—	—	—	—	—	—	—	—	—	—	70.2	—	—	—
GPT-5	OpenAI	—	—	—	—	—	—	—	—	75.0	—	1355	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	86.0	—	—	—	—	52.0	—
GPT-5 Pro	OpenAI	—	—	—	—	—	—	—	—	76.5	—	1360	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	88.0	—	—	97.5	—	—	—
GPT-5.2	OpenAI	78.0★	72.0★	52.0★	80.0	92.0★	73.0	—	—	78.0	92.0★	1370	9.5★	90.0	85.0★	82.0★	88.0★	94.0★	86.0	—	—	—	95.0★	—	72.0★	—	—	—	89.0	24.0	88.0★	—	—	58.0	—
GPT-5.2 Pro	OpenAI	—	—	—	—	—	—	—	—	80.0★	—	1380★	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	91.0★	—	—	98.5★	—	—	—
Grok 2	xAI	—	—	—	—	—	—	—	—	—	—	1270	—	—	—	—	—	—	—	—	87.5	—	—	—	—	—	—	—	—	—	—	—	—	—	—
Grok 3	xAI	—	—	—	—	82.0	—	93.8	—	—	—	1329	—	—	—	—	—	—	—	—	91.0	—	—	—	—	—	83.9	—	68.2	—	—	91.5	—	—	—
Grok 3 Mini	xAI	—	—	—	—	—	—	88.0	—	—	—	1305	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	85.0	—	—	—
Grok 4	xAI	70.0	66.0	40.0	—	—	—	—	—	—	89.0	1345	9.3	84.0	79.0	76.0	83.0	—	80.0	—	—	—	—	—	—	—	—	—	82.0	21.0	84.0	95.0	—	—	—
Grok 4 Fast	xAI	—	—	—	—	—	—	90.0	—	—	—	1310	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
Llama 3.1 405B	Meta	—	—	—	—	—	—	89.0	—	—	—	1253	—	—	—	—	—	—	—	—	88.6	—	—	—	—	—	—	—	51.1	—	—	—	—	—	—
Llama 3.3 70B Instruct	Meta	—	—	—	—	—	—	88.4	—	—	—	1250	—	—	—	—	—	—	—	—	86.0	—	—	—	—	—	—	—	50.7	—	—	—	—	—	—
Llama 4 Maverick	Meta	—	—	—	—	—	—	87.5	—	—	—	1290	8.5	78.0	68.0	—	72.0	—	—	—	85.5	—	84.0	—	—	—	—	—	56.0	—	—	81.5	—	—	—
Llama 4 Scout	Meta	—	—	—	—	—	—	85.0	—	—	—	1275	—	—	—	—	—	—	—	—	83.0	—	—	—	—	—	—	—	—	—	—	76.8	—	—	—
Mistral Large	Mistral	—	—	—	—	—	—	89.5	—	45.0	—	1295	—	80.0	—	—	—	—	—	—	87.0	—	—	—	—	—	—	—	58.0	—	—	—	—	—	—
Mistral Small 3.1 24B	Mistral	—	—	—	—	—	—	80.0	—	—	—	1235	—	—	—	—	—	—	—	—	78.0	—	—	—	—	—	—	—	—	—	—	—	—	—	—
Nemotron 70B	NVIDIA	—	—	—	—	—	—	—	—	—	—	1265	—	—	—	—	—	—	—	—	85.0	—	—	—	—	—	—	—	—	—	—	—	—	—	—
o1	OpenAI	—	—	—	—	—	—	94.0	—	48.9	—	1335	—	—	—	—	—	—	—	—	91.8	—	—	—	—	—	—	—	78.0	—	—	96.4	—	—	—
o1-mini	OpenAI	—	—	—	—	—	—	92.4	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	60.0	—	—	90.0	—	—	—
o3	OpenAI	72.0	68.0	45.0	76.0	—	74.0★	97.0★	—	69.1	88.5	1337	9.1	—	82.0	79.0	85.0	92.0	—	—	92.0★	—	93.0	—	—	—	91.6	—	83.3	—	85.0	96.7	—	49.0	—
o3 Mini	OpenAI	—	—	—	—	—	—	94.5	—	—	—	1310	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	75.0	—	—	94.0	—	—	—
o3 Pro	OpenAI	—	—	—	—	—	—	—	—	73.0	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	96.7★	—	87.5	26.6★	—	98.0	—	—	—
o4 Mini	OpenAI	—	—	—	—	—	—	96.0	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	92.7	—	81.4	—	—	96.3	—	—	—
Phi 4	Microsoft	—	—	—	—	—	—	82.6	—	—	—	—	—	—	—	—	—	—	—	—	84.8	—	—	—	—	—	—	—	56.1	—	—	80.4	—	—	—
Phi-4 Reasoning	Microsoft	—	—	—	—	—	—	89.0	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	75.3	—	65.8	—	—	94.3	—	—	—
Qwen2.5 72B Instruct	Alibaba	—	—	—	—	—	—	86.6	—	—	—	1245	—	—	—	—	—	—	—	—	86.1	—	—	—	—	—	—	—	49.0	—	—	80.0	—	—	—
Qwen2.5 Coder 32B Instruct	Alibaba	—	—	—	73.7	—	—	92.7	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
Qwen3 235B A22B	Alibaba	—	—	—	—	—	—	—	—	—	—	1320	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	78.0	92.0	—	—	—
Qwen3 Max	Alibaba	—	—	—	—	—	—	93.0	—	—	—	1335	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	72.0	—	—	93.5	—	—	—
QwQ 32B	Alibaba	—	—	—	—	—	—	88.0	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	79.5	—	63.0	—	—	90.6	—	—	—
R1	DeepSeek	—	55.0	28.0	65.0	—	65.0	92.5	—	—	82.0	1318	8.9	72.0	76.0	72.0	76.0	—	68.0	—	90.8	—	—	—	—	—	79.8	—	71.5	18.0	80.0	97.3	—	—	—
R1 0528	DeepSeek	—	—	—	—	—	—	94.0	—	57.6	—	1328	—	—	—	—	—	—	—	—	—	—	—	—	—	—	87.5	—	76.0	—	—	97.8	—	—	—

★ = best score in column— = not yet measuredScores from official reports and LMSYS Chatbot Arena

Benchmark Descriptions

GAIA agent

General AI Assistant tasks requiring web browsing, reasoning, and tool use

Scale: 0–100 Weight: 1.3x View details

TAU-bench agent

Tool-Agent-User interaction quality across multi-step scenarios

Scale: 0–100 Weight: 1x View details

WebArena agent

Autonomous web navigation and task completion in realistic environments

Scale: 0–100 Weight: 1.1x View details

Aider Polyglot coding

Multi-language coding: 225 exercises across C++, Go, Java, JS, Python, Rust

Scale: 0–100 Weight: 1.1x View details

BFCL coding

Berkeley Function Calling Leaderboard — tool/function invocation accuracy

Scale: 0–100 Weight: 1x View details

BigCodeBench coding

Challenging code generation tasks with complex function calls and libraries

Scale: 0–100 Weight: 1.1x View details

HumanEval coding

Python function completion (pass@1)

Scale: 0–100 Weight: 1x View details

LiveCodeBench coding

Competitive programming from live contests

Scale: 0–100 Weight: 1.2x View details

SWE-bench Verified coding

Real-world GitHub issue resolution

Scale: 0–100 Weight: 1.5x View details

Arena-Hard conversational

Automated benchmark using GPT-4 as judge on challenging Arena questions

Scale: 0–100 Weight: 1.2x View details

Chatbot Arena ELO conversational

LMSYS Chatbot Arena crowdsourced ELO rating

Scale: 800–1400 Weight: 1.5x View details

MT-Bench conversational

Multi-turn conversation benchmark judged by GPT-4

Scale: 0–10 Weight: 0.8x View details

Creative Writing Bench domain

Expert-judged creative writing quality across fiction, poetry, and narrative tasks

Scale: 0–100 Weight: 0.8x View details

FinQA domain

Financial question answering over earnings reports — numerical reasoning on real SEC filings

Scale: 0–100 Weight: 0.8x View details

FinanceBench domain

Open-ended financial analysis — 150 questions over 10-K and 10-Q filings

Scale: 0–100 Weight: 0.8x View details

LegalBench domain

Legal reasoning across 162 tasks: issue-spotting, rule-recall, interpretation

Scale: 0–100 Weight: 0.8x View details

MedQA domain

US Medical Licensing Exam questions — medical knowledge and reasoning

Scale: 0–100 Weight: 0.8x View details

WildBench Creative domain

Creative subset of WildBench — real user creative writing prompts judged by GPT-4

Scale: 0–100 Weight: 0.8x View details

IFEval instruction

Instruction Following Evaluation

Scale: 0–100 Weight: 0.8x View details

MMLU knowledge

Massive Multitask Language Understanding — 57 academic subjects

Scale: 0–100 Weight: 0.8x View details

MMLU-Pro knowledge

Harder version of MMLU with 10 answer choices

Scale: 0–100 Weight: 1.2x View details

MGSM multilingual

Multilingual Grade School Math — 250 problems in 10 languages

Scale: 0–100 Weight: 0.7x View details

MMMU multimodal

Massive Multi-discipline Multimodal Understanding

Scale: 0–100 Weight: 1x View details

MMMU-Pro multimodal

Enhanced multimodal understanding — harder than MMMU with no shortcut strategies

Scale: 0–100 Weight: 1.2x View details

MathVista multimodal

Mathematical reasoning in visual contexts — diagrams, charts, figures

Scale: 0–100 Weight: 1x View details

AIME 2025 reasoning

American Invitational Mathematics Examination 2025

Scale: 0–100 Weight: 1.4x View details

ARC Challenge reasoning

AI2 Reasoning Challenge — grade school science

Scale: 0–100 Weight: 0.6x View details

GPQA Diamond reasoning

Graduate-level science questions, expert-validated

Scale: 0–100 Weight: 1.5x View details

Humanity's Last Exam reasoning

Ultra-hard questions from experts across 100+ academic subjects

Scale: 0–100 Weight: 1.5x View details

LiveBench reasoning

Contamination-free benchmark using recent questions from math, coding, and reasoning

Scale: 0–100 Weight: 1.3x View details

MATH-500 reasoning

Competition-level mathematics problems

Scale: 0–100 Weight: 1.3x View details

AIR-Bench safety

AI safety aligned with regulations — 5,694 tests across 314 risk categories

Scale: 0–100 Weight: 0.8x View details

SimpleQA safety

Factual accuracy on straightforward questions — measures hallucination rate

Scale: 0–100 Weight: 1x View details

TrustLLM safety

Comprehensive trustworthiness: truthfulness, safety, fairness, robustness, privacy

Scale: 0–100 Weight: 0.8x View details