Meta-benchmark / everyday reliability

Reliability Floor

A model should not look reliable just because it is brilliant at hard maths or coding. This view tracks the basic failures normal users still hit: wrong facts, unsupported claims, missed retrieval, forgotten context, and ignored instructions.

0

Rank eligible

8

Partial evidence

9

Tracked sources

17 May 2026

Last evidence

Reliability Floor is intentionally strict. It uses a gated score: the weighted average is capped by the weakest critical dimension, and a model cannot be labelled reliable until factuality, groundedness, retrieval, and instruction following are all present.

Current evidence

Everyday reliability by model

Missing benchmark data stays visible. It is not treated as zero, and it is not filled with generic quality scores.

Model Floor Evidence Facts Grounding Retrieval Instructions Memory Updated
GPT-5.2

OpenAI / active

Not rankable

58.0 weighted avg

Partial evidence

1/5 dimensions

58.0 No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet 17 May 2026
GPT-5

OpenAI / active

Not rankable

52.0 weighted avg

Partial evidence

1/5 dimensions

52.0 No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet 17 May 2026
Claude Opus 4

Anthropic / active

Not rankable

44.0 weighted avg

Partial evidence

1/5 dimensions

44.0 No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet 17 May 2026
Claude Sonnet 4

Anthropic / active

Not rankable

41.0 weighted avg

Partial evidence

1/5 dimensions

41.0 No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet 17 May 2026
O3

OpenAI / active

Not rankable

49.0 weighted avg

Partial evidence

1/5 dimensions

49.0 No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet 17 May 2026
Gemini 2.5 Pro

Google / active

Not rankable

47.0 weighted avg

Partial evidence

1/5 dimensions

47.0 No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet 17 May 2026
GPT-4.5

OpenAI / tracking

Not rankable

62.5 weighted avg

Partial evidence

1/5 dimensions

62.5 No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet 17 May 2026
GPT-4o (2024-05-13)

OpenAI / active

Not rankable

38.2 weighted avg

Partial evidence

1/5 dimensions

38.2 No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet 17 May 2026
Claude Opus 4.7 (Fast)

Anthropic / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
GPT Chat Latest

OpenAI / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
Grok 4.3

xAI / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
Mistral Medium 3.5

Mistral / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
Qwen3.5 Plus 2026-04-20

Alibaba / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
Qwen3.6 27B

Alibaba / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
Qwen3.6 35B A3B

Alibaba / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
Qwen3.6 Flash

Alibaba / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
Qwen3.6 Max Preview

Alibaba / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
DeepSeek V4 Flash

DeepSeek / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
DeepSeek V4 Pro

DeepSeek / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
GPT-5.5

OpenAI / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
Gemma 4

Google / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
Qwen 3.6 Plus

Alibaba / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
Grok 4.20

xAI / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
Kimi K2.5

Moonshot AI / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
MiniMax M2.7

MiniMax / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
GPT-5.4

OpenAI / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
Gemini 3.1 Pro

Google / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
Claude Sonnet 4.6

Anthropic / active

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
GLM-5

Zhipu AI / tracking

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
Claude Opus 4.6

Anthropic / active

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
DeepSeek V3.2

DeepSeek / active

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence
DeepSeek R1

DeepSeek / active

Not rankable

No floor score yet

Tracking

0/5 dimensions

No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet No comparable evidence yet Awaiting evidence

Scoring shape

Dimensions and weights

The headline score is a floor, not a trophy cabinet. Weak critical basics cap the final number.

Critical

Factual accuracy

25%

Does the model answer verifiable questions accurately instead of confidently guessing?

Critical

Groundedness

20%

Does the model stay faithful to supplied sources and avoid unsupported claims?

Critical

Retrieval

20%

Can the model find and use the right information inside long or noisy context?

Critical

Instruction following

20%

Does the model follow exact user constraints, formats, exclusions, and ordering?

Supporting

Memory and abstention

15%

Can the model preserve relevant user information and admit when evidence is missing?

Prompt and harness context

Tracked benchmark sources

Benchmark outcomes measure model plus prompt plus harness plus judge. V1 records that context instead of pretending prompt robustness is already a clean headline metric.

Source Dimension Prompt context Harness Judge
SimpleQA

official

Factual accuracy Published benchmark default; short fact-seeking prompts. Simple factual QA harness with correct, incorrect, and not-attempted grading. Benchmark-specific verifier and grading process.
FACTS Grounding

benchmark-maintainer

Groundedness Published benchmark default; long-form answers grounded in supplied documents. Eligibility check followed by factual-grounding judgement. Multiple frontier LLM judges in the published setup.
IFEval

research

Instruction following Zero-shot prompts with verifiable instruction constraints. Rule-based checks for instruction constraints such as required words, counts, and format. Deterministic validators where possible.
RULER

research

Retrieval Synthetic long-context retrieval and tracing prompts. Configurable long-context evaluation tasks beyond simple needle-in-haystack. Task-specific exact or structured scoring.
HELMET

research

Retrieval Long-context task prompts across RAG, citations, summarization, and reranking. Holistic long-context benchmark suite. Task-specific scoring; varies by subtask.
CRAG

research

Groundedness Factual QA with web and knowledge-graph retrieval simulation. RAG benchmark where retrieval setup affects the final result. Benchmark-specific factual QA scoring.
LongMemEval

research

Memory and abstention Multi-session assistant memory questions over long user histories. Memory benchmark covering extraction, temporal reasoning, updates, and abstention. Benchmark-specific answer matching and evaluation.
LiveBench instruction tasks

benchmark-maintainer

Instruction following Published LiveBench task setup; use only relevant instruction-following categories for this dimension. Frequently refreshed objective benchmark tasks. Objective ground-truth scoring where available.
HELM reliability scenarios

research

Factual accuracy Scenario-specific HELM prompts; only reliability-relevant scenarios should be mapped into this floor. Holistic Evaluation of Language Models harness with explicit scenarios, metrics, and adapters. Scenario-specific automated metrics and human or model-graded evaluations where applicable.

Watchlist

Current models still waiting for comparable evidence

These remain visible because absence of evidence is useful information in a live reference hub.