Factuality and hallucination
SimpleQA
OpenAI publication + open benchmark Short-form factuality benchmark for fact-seeking questions with single verifiable answers. Useful for measuring whether a model answers accurately instead of confidently guessing.
Strong fit for the Reliability Floor factuality dimension, but too narrow to stand alone as a general reliability score.
- Targets hallucination on short factual questions
- Includes correct, incorrect, and not-attempted grading
- Useful for calibration and abstention analysis
- OpenAI reports the benchmark was designed to remain challenging for frontier models
Grounded long-form factuality
FACTS Grounding
Google DeepMind benchmark + Kaggle leaderboard Benchmark for whether long-form answers stay faithful to supplied documents while still addressing the user request.
Good fit for the Reliability Floor groundedness dimension because it tests source-grounded answers rather than broad world knowledge alone.
- Long-form grounded response evaluation
- Public and private evaluation sets
- Documents span finance, technology, retail, medicine, and law
- Uses separate eligibility and factual-grounding judgement phases
Instruction following
IFEval
Paper + Google Research code/data Verifiable instruction-following benchmark covering constraints such as word counts, required keywords, and formatting rules.
Useful because it avoids purely subjective judging for a core reliability question: did the model do exactly what it was told?
- Around 500 prompts
- 25 verifiable instruction types
- Objective checks for many prompt constraints
- Good first source for instruction-following sub-scores
Long-context retrieval
RULER
Paper + open-source code Long-context benchmark that extends needle-in-haystack into harder retrieval, multi-hop tracing, aggregation, and question-answering tasks.
Use as a retrieval stress test, not as proof of real-world long-context reliability by itself.
- 13 representative long-context tasks
- Tests behavior beyond simple literal retrieval
- Configurable sequence length and task complexity
- Exposes failures as context length increases
Real long-context tasks
HELMET
Hugging Face benchmark + paper Holistic long-context evaluation suite with more application-like tasks including retrieval-augmented generation, citations, summarization, and reranking.
Useful counterweight to simple needle tests because the authors explicitly warn that simple synthetic retrieval does not predict real downstream performance well.
- Evaluates diverse long-context applications
- Controls input length and task complexity
- Includes RAG, citation, summarization, and reranking-style tasks
- Shows categories do not always correlate with each other
Comprehensive RAG benchmark for dynamic factual question answering with web and knowledge-graph search simulations.
Best treated as a RAG-system reliability source, because scores depend on retrieval setup as well as the base model.
- 4,409 factual QA pairs
- Five domains and eight question categories
- Covers popularity and temporal dynamism
- Measures hallucination pressure in retrieval-augmented answers
Long-term memory
LongMemEval
OpenReview paper Benchmark for long-term assistant memory across sustained interactions, including extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.
Useful for the Reliability Floor memory dimension, but memory claims need especially careful evidence labels and limitations.
- 500 curated questions
- Scalable user-assistant chat histories
- Tests five core long-term memory abilities
- Separates memory design choices across indexing, retrieval, and reading stages
Fresh objective tasks
LiveBench
Leaderboard + paper + GitHub Frequently updated objective benchmark covering math, coding, reasoning, language, instruction following, and data analysis.
Useful for fresh-data discipline and objective grading; only the relevant categories should contribute to the Reliability Floor.
- Frequently updated questions
- Objective ground-truth scoring
- Includes instruction following and data analysis categories
- Designed to reduce contamination risk
Holistic model evaluation
HELM
Stanford CRFM benchmark + framework Stanford CRFM evaluation framework and benchmark collection designed to make model comparisons more transparent across scenarios, metrics, prompts, and adapters.
Useful as a provenance model for the Reliability Floor because it treats prompt, harness, scenario, and metric choices as part of the result rather than invisible background detail.
- Explicit scenario and metric structure
- Records model, adapter, prompt, and metric details
- Covers multiple capability and risk dimensions
- Good reference model for transparent eval reporting
Agent evals and degradation tracking
MarginLab
Public site + docs + GitHub Open benchmark and eval ecosystem focused on robust, reproducible agent testing, with public degradation trackers and historical views for coding agents.
Useful when you want repeated agent and harness measurement rather than a single static benchmark snapshot.
- Degradation Tracker for agent performance drift
- Historical performance views over time
- Benchmark explorers including SWE-Bench Pro and Terminal-Bench 2.0
- Open-source eval runtime that tracks accuracy, tokens, duration, and traces