Humanity's Last Exam

reasoning

Humanity's Last Exam is an ultra-hard benchmark containing questions from 100+ experts across every academic discipline — questions so difficult that they represent the frontier of human knowledge.

View paper / source

6

Models Tested

26.6

Best Score

22.8

Average Score

0–100

Scale Range

1.5x

Weight

How It Works

Experts from fields like quantum physics, advanced mathematics, constitutional law, and ancient history contribute questions that are at the very limit of human expertise. Models answer in a free-form format with expert validation.

Why It Matters

As AI models saturate existing benchmarks, we need harder tests. Humanity's Last Exam tests whether models can match the very best human experts. Current top scores are around 25%, showing significant room for improvement.

Limitations

Very small question set from each domain. Expert disagreement on correct answers. Some questions may be more about niche knowledge than general intelligence. Top scores of ~25% mean most results are noisy.

Leaderboard — Humanity's Last Exam

# Model Provider Score
🥇 o3 Pro OpenAI 26.6
🥈 Gemini 3.1 Pro Preview Google 25.0
🥉 GPT-5.2 OpenAI 24.0
4 Claude Opus 4.6 Anthropic 22.0
5 Grok 4 xAI 21.0
6 R1 DeepSeek 18.0
All Benchmarks