SimpleQA

safety

SimpleQA tests factual accuracy on straightforward questions that should have clear, verifiable answers. It measures how often models hallucinate or confabulate when answering simple factual queries.

Models Tested

62.5

Best Score

49.0

Average Score

0–100

Scale Range

Weight

How It Works

Models are asked simple factual questions (e.g. "Who directed Inception?") and answers are verified against ground truth. The score represents the percentage of correct, non-hallucinated responses.

Why It Matters

Hallucination is one of the biggest practical problems with LLMs. SimpleQA directly measures this by testing whether models can reliably provide accurate information on questions with unambiguous answers.

Limitations

Only tests factual recall, not reasoning. Questions are relatively simple — models may hallucinate more on complex topics. English-centric. Does not test the model's ability to say "I don't know".

Leaderboard — SimpleQA

#	Model	Provider	Score	Source	Measured
🥇	GPT-4.5	OpenAI	62.5	OpenAI	Feb 2025
🥈	GPT-5.2	OpenAI	58.0	OpenAI	Dec 2025
🥉	GPT-5	OpenAI	52.0	OpenAI	Aug 2025
4	o3	OpenAI	49.0	OpenAI	Apr 2025
5	Gemini 2.5 Pro Preview 06-05	Google	47.0	Google	Mar 2025
6	Claude Opus 4	Anthropic	44.0	Anthropic	May 2025
7	Claude Sonnet 4	Anthropic	41.0	Anthropic	May 2025
8	GPT-4o	OpenAI	38.2	OpenAI	Nov 2024

All Benchmarks