MMLU

knowledge

MMLU (Massive Multitask Language Understanding) tests a model across 57 academic subjects including STEM, humanities, social sciences, and more. It measures breadth of knowledge from elementary to professional level.

View paper / source

13

Models Tested

92.0

Best Score

88.9

Average Score

0–100

Scale Range

0.8x

Weight

How It Works

The model is given multiple-choice questions (4 options) across 57 subjects. Questions range from elementary mathematics to professional medicine and law. The test uses a few-shot format where the model sees examples before answering.

Why It Matters

MMLU is one of the most widely-cited benchmarks because it tests general knowledge breadth. A model that scores well on MMLU demonstrates broad competence across many domains, making it a useful proxy for general intelligence.

Limitations

MMLU relies on multiple-choice format which can be gamed. Some questions are ambiguous or have contested answers. It tests recall more than reasoning. Many modern models now saturate the benchmark (>90%), reducing its discriminative power.

Leaderboard — MMLU

# Model Provider Score
🥇 o3 OpenAI 92.0
🥈 Grok 3 Beta xAI 91.0
🥉 R1 DeepSeek 90.8
4 Gemini 2.5 Pro Preview 06-05 Google 90.5
5 GPT-4.1 OpenAI 90.2
6 DeepSeek V3 0324 DeepSeek 89.5
7 Claude Opus 4 Anthropic 89.0
8 GPT-4o (2024-05-13) OpenAI 88.7
9 DeepSeek V3 DeepSeek 88.5
10 Claude Sonnet 4 Anthropic 88.0
11 Gemini 2.5 Flash Google 86.5
12 Qwen2.5 72B Instruct Alibaba 86.1
13 Llama 4 Maverick Meta 85.5
All Benchmarks