MMLU
knowledgeMMLU (Massive Multitask Language Understanding) tests a model across 57 academic subjects including STEM, humanities, social sciences, and more. It measures breadth of knowledge from elementary to professional level.
View paper / source13
Models Tested
92.0
Best Score
88.9
Average Score
0–100
Scale Range
0.8x
Weight
How It Works
The model is given multiple-choice questions (4 options) across 57 subjects. Questions range from elementary mathematics to professional medicine and law. The test uses a few-shot format where the model sees examples before answering.
Why It Matters
MMLU is one of the most widely-cited benchmarks because it tests general knowledge breadth. A model that scores well on MMLU demonstrates broad competence across many domains, making it a useful proxy for general intelligence.
Limitations
MMLU relies on multiple-choice format which can be gamed. Some questions are ambiguous or have contested answers. It tests recall more than reasoning. Many modern models now saturate the benchmark (>90%), reducing its discriminative power.
Leaderboard — MMLU
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | o3 | OpenAI | 92.0 | |
| 🥈 | Grok 3 Beta | xAI | 91.0 | |
| 🥉 | R1 | DeepSeek | 90.8 | |
| 4 | Gemini 2.5 Pro Preview 06-05 | 90.5 | | |
| 5 | GPT-4.1 | OpenAI | 90.2 | |
| 6 | DeepSeek V3 0324 | DeepSeek | 89.5 | |
| 7 | Claude Opus 4 | Anthropic | 89.0 | |
| 8 | GPT-4o (2024-05-13) | OpenAI | 88.7 | |
| 9 | DeepSeek V3 | DeepSeek | 88.5 | |
| 10 | Claude Sonnet 4 | Anthropic | 88.0 | |
| 11 | Gemini 2.5 Flash | 86.5 | | |
| 12 | Qwen2.5 72B Instruct | Alibaba | 86.1 | |
| 13 | Llama 4 Maverick | Meta | 85.5 | |