Floor Capability - meta benchmark leaderboard

#	Model	Floor score	DSWE	Evidence %
1	Claude Fable 5 Anthropic	84	pending	74%
2	Claude Opus 4.8 Anthropic	80	58% · $12.58 · 43m · 136k	82%
3	GPT-5.5 xhigh OpenAI	78	70% · $6.61 · 21m · 47k	70%
4	Gemini 3.1 Pro Google	77	pending	76%
5	Qwen3.7 Max Alibaba	74	pending	62%
6	Gemini 3.5 Flash Google	73	28% · $7.42 · 17m · 189k	68%
7	MiniMax-M3 MiniMax	71	20% · $5.57 · 57m · 98k	64%
8	Kimi K2.6 Moonshot AI	70	24% · $3.16 · 56m · 84k	64%
9	GLM-5.1 Reasoning Z.ai	67	18% · $7.46 · 35m · 49k	58%
10	DeepSeek V4 Pro DeepSeek	66	8% · $4.22 · 37m · 50k	54%

#	Model	Ops score	DSWE	Evidence %
1	GPT-5.5 xhigh OpenAI	86	70% · $6.61 · 21m · 47k	70%
2	Gemini 3.5 Flash Google	78	28% · $7.42 · 17m · 189k	68%
3	Kimi K2.6 Moonshot AI	77	24% · $3.16 · 56m · 84k	64%
4	DeepSeek V4 Pro DeepSeek	72	8% · $4.22 · 37m · 50k	54%
5	MiniMax-M3 MiniMax	70	20% · $5.57 · 57m · 98k	64%
6	GLM-5.1 Reasoning Z.ai	68	18% · $7.46 · 35m · 49k	58%
7	Gemini 3.1 Pro Google	67	pending	76%
8	Qwen3.7 Max Alibaba	62	pending	62%
9	Claude Opus 4.8 Anthropic	60	58% · $12.58 · 43m · 136k	82%
10	Claude Fable 5 Anthropic	58	pending	74%

#	Model	Frontier score	DSWE	Evidence %
1	Claude Fable 5 Anthropic	92	pending	74%
2	Claude Opus 4.8 Anthropic	88	58% · $12.58 · 43m · 136k	82%
3	GPT-5.5 xhigh OpenAI	87	70% · $6.61 · 21m · 47k	70%
4	Gemini 3.1 Pro Google	86	pending	76%
5	Qwen3.7 Max Alibaba	81	pending	62%
6	Gemini 3.5 Flash Google	79	28% · $7.42 · 17m · 189k	68%
7	MiniMax-M3 MiniMax	77	20% · $5.57 · 57m · 98k	64%
8	Kimi K2.6 Moonshot AI	76	24% · $3.16 · 56m · 84k	64%
9	GLM-5.1 Reasoning Z.ai	72	18% · $7.46 · 35m · 49k	58%
10	DeepSeek V4 Pro DeepSeek	71	8% · $4.22 · 37m · 50k	54%

#	Model	Research score	DSWE	Evidence %
1	Claude Fable 5 Anthropic	82	pending	74%
2	Claude Opus 4.8 Anthropic	82	58% · $12.58 · 43m · 136k	82%
3	Gemini 3.1 Pro Google	76	pending	76%
4	GPT-5.5 xhigh OpenAI	75	70% · $6.61 · 21m · 47k	70%
5	Qwen3.7 Max Alibaba	70	pending	62%
6	Gemini 3.5 Flash Google	68	28% · $7.42 · 17m · 189k	68%
7	MiniMax-M3 MiniMax	67	20% · $5.57 · 57m · 98k	64%
8	Kimi K2.6 Moonshot AI	66	24% · $3.16 · 56m · 84k	64%
9	GLM-5.1 Reasoning Z.ai	64	18% · $7.46 · 35m · 49k	58%
10	DeepSeek V4 Pro DeepSeek	62	8% · $4.22 · 37m · 50k	54%

Score

0-100 saturation. 100 = the lane is complete for that model; a saturated lane gets retired or hardened.

Evidence %

Source coverage behind the score, not model quality.

DSWE

Agentic coding with cost, time, and output tokens, from DeepSWE.

Confidence

Evidence strength, on the ? beside each score. Never another model grade.

10 ranked models - sorted by Floor Capability Open the full Meta Benchmark Hub →

Today on the hub

97ranked public LLM rows

37open-weight rows

32rows with speed data

Value leader: Mistral Nemo