Humanity's Last Exam
reasoningHumanity's Last Exam is an ultra-hard benchmark containing questions from 100+ experts across every academic discipline — questions so difficult that they represent the frontier of human knowledge.
View paper / source6
Models Tested
26.6
Best Score
22.8
Average Score
0–100
Scale Range
1.5x
Weight
How It Works
Experts from fields like quantum physics, advanced mathematics, constitutional law, and ancient history contribute questions that are at the very limit of human expertise. Models answer in a free-form format with expert validation.
Why It Matters
As AI models saturate existing benchmarks, we need harder tests. Humanity's Last Exam tests whether models can match the very best human experts. Current top scores are around 25%, showing significant room for improvement.
Limitations
Very small question set from each domain. Expert disagreement on correct answers. Some questions may be more about niche knowledge than general intelligence. Top scores of ~25% mean most results are noisy.