SimpleQA
safetySimpleQA tests factual accuracy on straightforward questions that should have clear, verifiable answers. It measures how often models hallucinate or confabulate when answering simple factual queries.
8
Models Tested
62.5
Best Score
49.0
Average Score
0–100
Scale Range
1x
Weight
How It Works
Models are asked simple factual questions (e.g. "Who directed Inception?") and answers are verified against ground truth. The score represents the percentage of correct, non-hallucinated responses.
Why It Matters
Hallucination is one of the biggest practical problems with LLMs. SimpleQA directly measures this by testing whether models can reliably provide accurate information on questions with unambiguous answers.
Limitations
Only tests factual recall, not reasoning. Questions are relatively simple — models may hallucinate more on complex topics. English-centric. Does not test the model's ability to say "I don't know".