MMLU

knowledge

MMLU (Massive Multitask Language Understanding) tests a model across 57 academic subjects including STEM, humanities, social sciences, and more. It measures breadth of knowledge from elementary to professional level.

View paper / source

Models Tested

92.0

Best Score

88.9

Average Score

0–100

Scale Range

0.8x

Weight

How It Works

The model is given multiple-choice questions (4 options) across 57 subjects. Questions range from elementary mathematics to professional medicine and law. The test uses a few-shot format where the model sees examples before answering.

Why It Matters

MMLU is one of the most widely-cited benchmarks because it tests general knowledge breadth. A model that scores well on MMLU demonstrates broad competence across many domains, making it a useful proxy for general intelligence.

Limitations

MMLU relies on multiple-choice format which can be gamed. Some questions are ambiguous or have contested answers. It tests recall more than reasoning. Many modern models now saturate the benchmark (>90%), reducing its discriminative power.

Leaderboard — MMLU

#	Model	Provider	Score	Source	Measured
🥇	o3	OpenAI	92.0	OpenAI	Apr 2025
🥈	Grok 3 Beta	xAI	91.0	xAI	Jun 2025
🥉	R1	DeepSeek	90.8	DeepSeek	Jan 2025
4	Gemini 2.5 Pro Preview 06-05	Google	90.5	Google	Mar 2025
5	GPT-4.1	OpenAI	90.2	OpenAI	Apr 2025
6	DeepSeek V3 0324	DeepSeek	89.5	DeepSeek	Sept 2025
7	Claude Opus 4	Anthropic	89.0	Anthropic	May 2025
8	GPT-4o (2024-05-13)	OpenAI	88.7	OpenAI	May 2024
9	DeepSeek V3	DeepSeek	88.5	DeepSeek	Dec 2024
10	Claude Sonnet 4	Anthropic	88.0	Anthropic	May 2025
11	Gemini 2.5 Flash	Google	86.5	Google	May 2025
12	Qwen2.5 72B Instruct	Alibaba	86.1	Alibaba	Sept 2024
13	Llama 4 Maverick	Meta	85.5	Meta	Apr 2025

All Benchmarks