Critical
Factual accuracy
25%Does the model answer verifiable questions accurately instead of confidently guessing?
Meta-benchmark / everyday reliability
A model should not look reliable just because it is brilliant at hard maths or coding. This view tracks the basic failures normal users still hit: wrong facts, unsupported claims, missed retrieval, forgotten context, and ignored instructions.
Rank eligible
Partial evidence
Tracked sources
Last evidence
Current evidence
Missing benchmark data stays visible. It is not treated as zero, and it is not filled with generic quality scores.
| Model | Floor | Evidence | Facts | Grounding | Retrieval | Instructions | Memory | Updated |
|---|---|---|---|---|---|---|---|---|
| GPT-5.2 OpenAI / active | Not rankable 58.0 weighted avg | Partial evidence 1/5 dimensions | 58.0 | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | 17 May 2026 |
| GPT-5 OpenAI / active | Not rankable 52.0 weighted avg | Partial evidence 1/5 dimensions | 52.0 | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | 17 May 2026 |
| Claude Opus 4 Anthropic / active | Not rankable 44.0 weighted avg | Partial evidence 1/5 dimensions | 44.0 | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | 17 May 2026 |
| Claude Sonnet 4 Anthropic / active | Not rankable 41.0 weighted avg | Partial evidence 1/5 dimensions | 41.0 | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | 17 May 2026 |
| O3 OpenAI / active | Not rankable 49.0 weighted avg | Partial evidence 1/5 dimensions | 49.0 | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | 17 May 2026 |
| Gemini 2.5 Pro Google / active | Not rankable 47.0 weighted avg | Partial evidence 1/5 dimensions | 47.0 | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | 17 May 2026 |
| GPT-4.5 OpenAI / tracking | Not rankable 62.5 weighted avg | Partial evidence 1/5 dimensions | 62.5 | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | 17 May 2026 |
| GPT-4o (2024-05-13) OpenAI / active | Not rankable 38.2 weighted avg | Partial evidence 1/5 dimensions | 38.2 | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | 17 May 2026 |
| Claude Opus 4.7 (Fast) Anthropic / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| GPT Chat Latest OpenAI / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| Grok 4.3 xAI / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| Mistral Medium 3.5 Mistral / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| Qwen3.5 Plus 2026-04-20 Alibaba / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| Qwen3.6 27B Alibaba / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| Qwen3.6 35B A3B Alibaba / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| Qwen3.6 Flash Alibaba / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| Qwen3.6 Max Preview Alibaba / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| DeepSeek V4 Flash DeepSeek / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| DeepSeek V4 Pro DeepSeek / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| GPT-5.5 OpenAI / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| Gemma 4 Google / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| Qwen 3.6 Plus Alibaba / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| Grok 4.20 xAI / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| Kimi K2.5 Moonshot AI / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| MiniMax M2.7 MiniMax / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| GPT-5.4 OpenAI / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| Gemini 3.1 Pro Google / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| Claude Sonnet 4.6 Anthropic / active | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| GLM-5 Zhipu AI / tracking | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| Claude Opus 4.6 Anthropic / active | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| DeepSeek V3.2 DeepSeek / active | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
| DeepSeek R1 DeepSeek / active | Not rankable No floor score yet | Tracking 0/5 dimensions | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | No comparable evidence yet | Awaiting evidence |
Scoring shape
The headline score is a floor, not a trophy cabinet. Weak critical basics cap the final number.
Critical
Does the model answer verifiable questions accurately instead of confidently guessing?
Critical
Does the model stay faithful to supplied sources and avoid unsupported claims?
Critical
Can the model find and use the right information inside long or noisy context?
Critical
Does the model follow exact user constraints, formats, exclusions, and ordering?
Supporting
Can the model preserve relevant user information and admit when evidence is missing?
Prompt and harness context
Benchmark outcomes measure model plus prompt plus harness plus judge. V1 records that context instead of pretending prompt robustness is already a clean headline metric.
| Source | Dimension | Prompt context | Harness | Judge |
|---|---|---|---|---|
| SimpleQA official | Factual accuracy | Published benchmark default; short fact-seeking prompts. | Simple factual QA harness with correct, incorrect, and not-attempted grading. | Benchmark-specific verifier and grading process. |
| FACTS Grounding benchmark-maintainer | Groundedness | Published benchmark default; long-form answers grounded in supplied documents. | Eligibility check followed by factual-grounding judgement. | Multiple frontier LLM judges in the published setup. |
| IFEval research | Instruction following | Zero-shot prompts with verifiable instruction constraints. | Rule-based checks for instruction constraints such as required words, counts, and format. | Deterministic validators where possible. |
| RULER research | Retrieval | Synthetic long-context retrieval and tracing prompts. | Configurable long-context evaluation tasks beyond simple needle-in-haystack. | Task-specific exact or structured scoring. |
| HELMET research | Retrieval | Long-context task prompts across RAG, citations, summarization, and reranking. | Holistic long-context benchmark suite. | Task-specific scoring; varies by subtask. |
| CRAG research | Groundedness | Factual QA with web and knowledge-graph retrieval simulation. | RAG benchmark where retrieval setup affects the final result. | Benchmark-specific factual QA scoring. |
| LongMemEval research | Memory and abstention | Multi-session assistant memory questions over long user histories. | Memory benchmark covering extraction, temporal reasoning, updates, and abstention. | Benchmark-specific answer matching and evaluation. |
| LiveBench instruction tasks benchmark-maintainer | Instruction following | Published LiveBench task setup; use only relevant instruction-following categories for this dimension. | Frequently refreshed objective benchmark tasks. | Objective ground-truth scoring where available. |
| HELM reliability scenarios research | Factual accuracy | Scenario-specific HELM prompts; only reliability-relevant scenarios should be mapped into this floor. | Holistic Evaluation of Language Models harness with explicit scenarios, metrics, and adapters. | Scenario-specific automated metrics and human or model-graded evaluations where applicable. |
Watchlist
These remain visible because absence of evidence is useful information in a live reference hub.