Humanity's Last Exam

reasoning

Humanity's Last Exam is an ultra-hard benchmark containing questions from 100+ experts across every academic discipline — questions so difficult that they represent the frontier of human knowledge.

View paper / source

Models Tested

26.6

Best Score

22.8

Average Score

0–100

Scale Range

1.5x

Weight

How It Works

Experts from fields like quantum physics, advanced mathematics, constitutional law, and ancient history contribute questions that are at the very limit of human expertise. Models answer in a free-form format with expert validation.

Why It Matters

As AI models saturate existing benchmarks, we need harder tests. Humanity's Last Exam tests whether models can match the very best human experts. Current top scores are around 25%, showing significant room for improvement.

Limitations

Very small question set from each domain. Expert disagreement on correct answers. Some questions may be more about niche knowledge than general intelligence. Top scores of ~25% mean most results are noisy.

Leaderboard — Humanity's Last Exam

#	Model	Provider	Score	Source	Measured
🥇	o3 Pro	OpenAI	26.6	OpenAI	Jun 2025
🥈	Gemini 3.1 Pro Preview	Google	25.0	Google	Feb 2026
🥉	GPT-5.2	OpenAI	24.0	OpenAI	Dec 2025
4	Claude Opus 4.6	Anthropic	22.0	Anthropic	Feb 2026
5	Grok 4	xAI	21.0	xAI	Jul 2025
6	R1	DeepSeek	18.0	DeepSeek	Jan 2025

All Benchmarks