What's new

AI Model Benchmarks

Compare 58 models across 34 benchmarks Standardised tests that measure specific AI capabilities — knowledge, reasoning, coding, maths, and human preference. Each benchmark tests a different skill. . Scores sourced from official model cards Technical documents published alongside a model release, detailing its capabilities, limitations, and benchmark results. , technical reports, and LMSYS Chatbot Arena A crowdsourced platform where real users compare two anonymous AI models side-by-side and vote for the better one. Over 2 million votes make it the most trusted human preference benchmark. .

External benchmark explorers and degradation trackers

Use these when you want deeper repeated-eval views, agent harness tracking, and historical performance beyond this site’s current benchmark matrix.

Factuality and hallucination

SimpleQA

OpenAI publication + open benchmark

Short-form factuality benchmark for fact-seeking questions with single verifiable answers. Useful for measuring whether a model answers accurately instead of confidently guessing.

Strong fit for the Reliability Floor factuality dimension, but too narrow to stand alone as a general reliability score.

Grounded long-form factuality

FACTS Grounding

Google DeepMind benchmark + Kaggle leaderboard

Benchmark for whether long-form answers stay faithful to supplied documents while still addressing the user request.

Good fit for the Reliability Floor groundedness dimension because it tests source-grounded answers rather than broad world knowledge alone.

Instruction following

IFEval

Paper + Google Research code/data

Verifiable instruction-following benchmark covering constraints such as word counts, required keywords, and formatting rules.

Useful because it avoids purely subjective judging for a core reliability question: did the model do exactly what it was told?

Long-context retrieval

RULER

Paper + open-source code

Long-context benchmark that extends needle-in-haystack into harder retrieval, multi-hop tracing, aggregation, and question-answering tasks.

Use as a retrieval stress test, not as proof of real-world long-context reliability by itself.

Real long-context tasks

HELMET

Hugging Face benchmark + paper

Holistic long-context evaluation suite with more application-like tasks including retrieval-augmented generation, citations, summarization, and reranking.

Useful counterweight to simple needle tests because the authors explicitly warn that simple synthetic retrieval does not predict real downstream performance well.

RAG factual QA

CRAG

NeurIPS paper + GitHub

Comprehensive RAG benchmark for dynamic factual question answering with web and knowledge-graph search simulations.

Best treated as a RAG-system reliability source, because scores depend on retrieval setup as well as the base model.

Long-term memory

LongMemEval

OpenReview paper

Benchmark for long-term assistant memory across sustained interactions, including extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.

Useful for the Reliability Floor memory dimension, but memory claims need especially careful evidence labels and limitations.

Fresh objective tasks

LiveBench

Leaderboard + paper + GitHub

Frequently updated objective benchmark covering math, coding, reasoning, language, instruction following, and data analysis.

Useful for fresh-data discipline and objective grading; only the relevant categories should contribute to the Reliability Floor.

Holistic model evaluation

HELM

Stanford CRFM benchmark + framework

Stanford CRFM evaluation framework and benchmark collection designed to make model comparisons more transparent across scenarios, metrics, prompts, and adapters.

Useful as a provenance model for the Reliability Floor because it treats prompt, harness, scenario, and metric choices as part of the result rather than invisible background detail.

Agent evals and degradation tracking

MarginLab

Public site + docs + GitHub

Open benchmark and eval ecosystem focused on robust, reproducible agent testing, with public degradation trackers and historical views for coding agents.

Useful when you want repeated agent and harness measurement rather than a single static benchmark snapshot.

58 models · 34 benchmarks
ModelProviderGAIATAU-benchWebArenaAider PolyglotBFCLBigCodeBenchHumanEvalLiveCodeBenchSWE-bench VerifiedArena-HardChatbot Arena ELOMT-BenchCreative Writing BenchFinQAFinanceBenchLegalBenchMedQAWildBench CreativeIFEvalMMLUMMLU-ProMGSMMMMUMMMU-ProMathVistaAIME 2025ARC ChallengeGPQA DiamondHumanity's Last ExamLiveBenchMATH-500AIR-BenchSimpleQATrustLLM
Claude 3 OpusAnthropic84.9125386.850.460.1
Claude 3.5 HaikuAnthropic88.1126083.069.3
Claude 3.5 SonnetAnthropic93.749.012858.788.765.078.3
Claude 3.7 SonnetAnthropic62.3131068.086.0
Claude Haiku 4.5Anthropic90.052.01295
Claude Opus 4Anthropic62.038.095.072.5133088.078.074.082.089.084.089.090.064.072.188.744.0
Claude Opus 4.5Anthropic95.575.0135578.0
Claude Opus 4.6Anthropic75.070.048.082.072.078.090.013659.492.083.080.086.088.022.086.5
Claude Sonnet 4Anthropic88.093.053.613108.886.074.079.082.088.067.585.441.0
Claude Sonnet 4.5Anthropic68.0134072.0
Claude Sonnet 4.6Anthropic79.072.086.0135083.0
Command ACohere85.0128083.5
DeepSeek V3DeepSeek89.5127588.559.178.3
DeepSeek V3 0324DeepSeek91.050.0131089.562.084.0
Gemini 1.5 ProGoogle84.1126085.967.7
Gemini 2.0 FlashGoogle1270
Gemini 2.5 FlashGoogle88.5130086.559.282.3
Gemini 2.5 Flash LiteGoogle82.0124080.0
Gemini 2.5 Pro Preview 06-05Google68.064.042.072.087.068.093.263.885.513409.285.080.077.084.091.082.090.592.068.086.768.482.090.247.0
Gemini 3 Flash PreviewGoogle89.055.01310
Gemini 3 ProGoogle70.0136280.0
Gemini 3.1 Pro PreviewGoogle137525.0
GPT-4.1OpenAI93.454.6128390.266.383.0
GPT-4.1 MiniOpenAI90.542.0127587.5
GPT-4.5OpenAI91.013008.890.262.082.062.5
GPT-4o (2024-05-13)OpenAI52.030.085.090.212858.682.072.068.078.086.178.088.786.551.053.676.638.2
GPT-4o-miniOpenAI87.2124082.070.2
GPT-5OpenAI75.0135586.052.0
GPT-5 ProOpenAI76.5136088.097.5
GPT-5.2OpenAI78.072.052.080.092.073.078.092.013709.590.085.082.088.094.086.095.072.089.024.088.058.0
GPT-5.2 ProOpenAI80.0138091.098.5
Grok 2xAI127087.5
Grok 3xAI82.093.8132991.083.968.291.5
Grok 3 MinixAI88.0130585.0
Grok 4xAI70.066.040.089.013459.384.079.076.083.080.082.021.084.095.0
Grok 4 FastxAI90.01310
Llama 3.1 405BMeta89.0125388.651.1
Llama 3.3 70B InstructMeta88.4125086.050.7
Llama 4 MaverickMeta87.512908.578.068.072.085.584.056.081.5
Llama 4 ScoutMeta85.0127583.076.8
Mistral LargeMistral89.545.0129580.087.058.0
Mistral Small 3.1 24BMistral80.0123578.0
Nemotron 70BNVIDIA126585.0
o1OpenAI94.048.9133591.878.096.4
o1-miniOpenAI92.460.090.0
o3OpenAI72.068.045.076.074.097.069.188.513379.182.079.085.092.092.093.091.683.385.096.749.0
o3 MiniOpenAI94.5131075.094.0
o3 ProOpenAI73.096.787.526.698.0
o4 MiniOpenAI96.092.781.496.3
Phi 4Microsoft82.684.856.180.4
Phi-4 ReasoningMicrosoft89.075.365.894.3
Qwen2.5 72B InstructAlibaba86.6124586.149.080.0
Qwen2.5 Coder 32B InstructAlibaba73.792.7
Qwen3 235B A22BAlibaba132078.092.0
Qwen3 MaxAlibaba93.0133572.093.5
QwQ 32BAlibaba88.079.563.090.6
R1DeepSeek55.028.065.065.092.582.013188.972.076.072.076.068.090.879.871.518.080.097.3
R1 0528DeepSeek94.057.6132887.576.097.8
★ = best score in column— = not yet measuredScores from official reports and LMSYS Chatbot Arena

Benchmark Descriptions

GAIA agent

General AI Assistant tasks requiring web browsing, reasoning, and tool use

Scale: 0–100 Weight: 1.3x View details
TAU-bench agent

Tool-Agent-User interaction quality across multi-step scenarios

Scale: 0–100 Weight: 1x View details
WebArena agent

Autonomous web navigation and task completion in realistic environments

Scale: 0–100 Weight: 1.1x View details

Multi-language coding: 225 exercises across C++, Go, Java, JS, Python, Rust

Scale: 0–100 Weight: 1.1x View details
BFCL coding

Berkeley Function Calling Leaderboard — tool/function invocation accuracy

Scale: 0–100 Weight: 1x View details
BigCodeBench coding

Challenging code generation tasks with complex function calls and libraries

Scale: 0–100 Weight: 1.1x View details
HumanEval coding

Python function completion (pass@1)

Scale: 0–100 Weight: 1x View details
LiveCodeBench coding

Competitive programming from live contests

Scale: 0–100 Weight: 1.2x View details

Real-world GitHub issue resolution

Scale: 0–100 Weight: 1.5x View details
Arena-Hard conversational

Automated benchmark using GPT-4 as judge on challenging Arena questions

Scale: 0–100 Weight: 1.2x View details
Chatbot Arena ELO conversational

LMSYS Chatbot Arena crowdsourced ELO rating

Scale: 800–1400 Weight: 1.5x View details
MT-Bench conversational

Multi-turn conversation benchmark judged by GPT-4

Scale: 0–10 Weight: 0.8x View details

Expert-judged creative writing quality across fiction, poetry, and narrative tasks

Scale: 0–100 Weight: 0.8x View details
FinQA domain

Financial question answering over earnings reports — numerical reasoning on real SEC filings

Scale: 0–100 Weight: 0.8x View details
FinanceBench domain

Open-ended financial analysis — 150 questions over 10-K and 10-Q filings

Scale: 0–100 Weight: 0.8x View details
LegalBench domain

Legal reasoning across 162 tasks: issue-spotting, rule-recall, interpretation

Scale: 0–100 Weight: 0.8x View details
MedQA domain

US Medical Licensing Exam questions — medical knowledge and reasoning

Scale: 0–100 Weight: 0.8x View details

Creative subset of WildBench — real user creative writing prompts judged by GPT-4

Scale: 0–100 Weight: 0.8x View details
IFEval instruction

Instruction Following Evaluation

Scale: 0–100 Weight: 0.8x View details
MMLU knowledge

Massive Multitask Language Understanding — 57 academic subjects

Scale: 0–100 Weight: 0.8x View details
MMLU-Pro knowledge

Harder version of MMLU with 10 answer choices

Scale: 0–100 Weight: 1.2x View details
MGSM multilingual

Multilingual Grade School Math — 250 problems in 10 languages

Scale: 0–100 Weight: 0.7x View details
MMMU multimodal

Massive Multi-discipline Multimodal Understanding

Scale: 0–100 Weight: 1x View details
MMMU-Pro multimodal

Enhanced multimodal understanding — harder than MMMU with no shortcut strategies

Scale: 0–100 Weight: 1.2x View details
MathVista multimodal

Mathematical reasoning in visual contexts — diagrams, charts, figures

Scale: 0–100 Weight: 1x View details
AIME 2025 reasoning

American Invitational Mathematics Examination 2025

Scale: 0–100 Weight: 1.4x View details
ARC Challenge reasoning

AI2 Reasoning Challenge — grade school science

Scale: 0–100 Weight: 0.6x View details
GPQA Diamond reasoning

Graduate-level science questions, expert-validated

Scale: 0–100 Weight: 1.5x View details

Ultra-hard questions from experts across 100+ academic subjects

Scale: 0–100 Weight: 1.5x View details
LiveBench reasoning

Contamination-free benchmark using recent questions from math, coding, and reasoning

Scale: 0–100 Weight: 1.3x View details
MATH-500 reasoning

Competition-level mathematics problems

Scale: 0–100 Weight: 1.3x View details
AIR-Bench safety

AI safety aligned with regulations — 5,694 tests across 314 risk categories

Scale: 0–100 Weight: 0.8x View details
SimpleQA safety

Factual accuracy on straightforward questions — measures hallucination rate

Scale: 0–100 Weight: 1x View details
TrustLLM safety

Comprehensive trustworthiness: truthfulness, safety, fairness, robustness, privacy

Scale: 0–100 Weight: 0.8x View details