SimpleQA

safety

SimpleQA tests factual accuracy on straightforward questions that should have clear, verifiable answers. It measures how often models hallucinate or confabulate when answering simple factual queries.

8

Models Tested

62.5

Best Score

49.0

Average Score

0–100

Scale Range

1x

Weight

How It Works

Models are asked simple factual questions (e.g. "Who directed Inception?") and answers are verified against ground truth. The score represents the percentage of correct, non-hallucinated responses.

Why It Matters

Hallucination is one of the biggest practical problems with LLMs. SimpleQA directly measures this by testing whether models can reliably provide accurate information on questions with unambiguous answers.

Limitations

Only tests factual recall, not reasoning. Questions are relatively simple — models may hallucinate more on complex topics. English-centric. Does not test the model's ability to say "I don't know".

Leaderboard — SimpleQA

# Model Provider Score
🥇 GPT-4.5 OpenAI 62.5
🥈 GPT-5.2 OpenAI 58.0
🥉 GPT-5 OpenAI 52.0
4 o3 OpenAI 49.0
5 Gemini 2.5 Pro Preview 06-05 Google 47.0
6 Claude Opus 4 Anthropic 44.0
7 Claude Sonnet 4 Anthropic 41.0
8 GPT-4o OpenAI 38.2
All Benchmarks