Factuality and hallucination
SimpleQA
OpenAI publication + open benchmark Short-form factuality benchmark for fact-seeking questions with single verifiable answers. Useful for measuring whether a model answers accurately instead of confidently guessing.
Strong fit for the Reliability Floor factuality dimension, but too narrow to stand alone as a general reliability score.
Grounded long-form factuality
FACTS Grounding
Google DeepMind benchmark + Kaggle leaderboard Benchmark for whether long-form answers stay faithful to supplied documents while still addressing the user request.
Good fit for the Reliability Floor groundedness dimension because it tests source-grounded answers rather than broad world knowledge alone.
Instruction following
IFEval
Paper + Google Research code/data Verifiable instruction-following benchmark covering constraints such as word counts, required keywords, and formatting rules.
Useful because it avoids purely subjective judging for a core reliability question: did the model do exactly what it was told?
Long-context retrieval
RULER
Paper + open-source code Long-context benchmark that extends needle-in-haystack into harder retrieval, multi-hop tracing, aggregation, and question-answering tasks.
Use as a retrieval stress test, not as proof of real-world long-context reliability by itself.
Real long-context tasks
HELMET
Hugging Face benchmark + paper Holistic long-context evaluation suite with more application-like tasks including retrieval-augmented generation, citations, summarization, and reranking.
Useful counterweight to simple needle tests because the authors explicitly warn that simple synthetic retrieval does not predict real downstream performance well.
Comprehensive RAG benchmark for dynamic factual question answering with web and knowledge-graph search simulations.
Best treated as a RAG-system reliability source, because scores depend on retrieval setup as well as the base model.
Long-term memory
LongMemEval
OpenReview paper Benchmark for long-term assistant memory across sustained interactions, including extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.
Useful for the Reliability Floor memory dimension, but memory claims need especially careful evidence labels and limitations.
Fresh objective tasks
LiveBench
Leaderboard + paper + GitHub Frequently updated objective benchmark covering math, coding, reasoning, language, instruction following, and data analysis.
Useful for fresh-data discipline and objective grading; only the relevant categories should contribute to the Reliability Floor.
Holistic model evaluation
HELM
Stanford CRFM benchmark + framework Stanford CRFM evaluation framework and benchmark collection designed to make model comparisons more transparent across scenarios, metrics, prompts, and adapters.
Useful as a provenance model for the Reliability Floor because it treats prompt, harness, scenario, and metric choices as part of the result rather than invisible background detail.
Agent evals and degradation tracking
MarginLab
Public site + docs + GitHub Open benchmark and eval ecosystem focused on robust, reproducible agent testing, with public degradation trackers and historical views for coding agents.
Useful when you want repeated agent and harness measurement rather than a single static benchmark snapshot.