1h data Recomputed benchmark-weighted quality scores 1h data Synced Chatbot Arena benchmark track 1h data Updated speed measurements 1h data Validated official pricing snapshots 1h data Pulled latest OpenRouter price index 10h jobs Jobs market snapshot refreshed 25 May digest OpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform 25 May digest Published the 2026-05-25 daily digest 25 May digest Workbench Launches Open Source BullMQ Dashboard For Node Backends 24 May digest SpecBench Tests Reward Hacking In Long Horizon Coding Agents 1h data Recomputed benchmark-weighted quality scores 1h data Synced Chatbot Arena benchmark track 1h data Updated speed measurements 1h data Validated official pricing snapshots 1h data Pulled latest OpenRouter price index 10h jobs Jobs market snapshot refreshed 25 May digest OpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform 25 May digest Published the 2026-05-25 daily digest 25 May digest Workbench Launches Open Source BullMQ Dashboard For Node Backends 24 May digest SpecBench Tests Reward Hacking In Long Horizon Coding Agents

1hRecomputed benchmark-weighted quality scores 1hSynced Chatbot Arena benchmark track 1hUpdated speed measurements 1hValidated official pricing snapshots 1hPulled latest OpenRouter price index 10hJobs market snapshot refreshed 25 MayOpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform 25 MayPublished the 2026-05-25 daily digest 25 MayWorkbench Launches Open Source BullMQ Dashboard For Node Backends 24 MaySpecBench Tests Reward Hacking In Long Horizon Coding Agents 1hRecomputed benchmark-weighted quality scores 1hSynced Chatbot Arena benchmark track 1hUpdated speed measurements 1hValidated official pricing snapshots 1hPulled latest OpenRouter price index 10hJobs market snapshot refreshed 25 MayOpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform 25 MayPublished the 2026-05-25 daily digest 25 MayWorkbench Launches Open Source BullMQ Dashboard For Node Backends 24 MaySpecBench Tests Reward Hacking In Long Horizon Coding Agents

AI Reasoning Models

Models ranked by reasoning, math, and logic benchmark performance. Reasoning models use extended "thinking" to solve complex multi-step problems.

Dedicated Reasoning Models

Models specifically designed for extended reasoning and chain-of-thought problem solving.

GPT-5.2 Pro

OpenAI

93.0/100 $21.00

Reasoning avg: 94.8

Premium reasoning tier Speed data hidden until it is refreshed from a current live measurement source.

O4 Mini

OpenAI

Reasoning avg: 90.1

Fast reasoning; excels at math/code

Claude Opus 4.6

Anthropic

89.0/100 $15.00

Reasoning avg: 54.3

Most capable; 1M context beta; adaptive thinking Anthropic flagship model.

O3

OpenAI

Reasoning avg: 89.2

Reasoning model; 80% price cut from launch

O3 Pro

OpenAI

88.0/100 $20.00

Reasoning avg: 77.2

Highest reasoning quality Speed data hidden until it is refreshed from a current live measurement source.

Qwen3 235B A22B

Alibaba

Reasoning avg: 85.0

Thinking mode: $0.65/$3.00

Claude Sonnet 4.6

Anthropic

Reasoning avg: 83.0

Default model; extended thinking Anthropic balanced frontier model.

DeepSeek R1

DeepSeek

Reasoning avg: 69.3

Open-weight reasoning; CoT tokens billed as output

o1

OpenAI

84.0/100 $15.00

Reasoning avg: 87.2

First-gen reasoning model; chain-of-thought Speed data hidden until it is refreshed from a current live measurement source.

o3 Mini

OpenAI

Reasoning avg: 84.5

Cost-efficient reasoning

Gemini 2.5 Pro

Google

Reasoning avg: 81.8

Thinking model with 1M context

R1 0528

DeepSeek

Reasoning avg: 87.1

Updated R1 with improved reasoning Speed data hidden until it is refreshed from a current live measurement source.

Gemini 2.5 Flash

Google

Reasoning avg: 70.8

With thinking: $3.50 output

DeepSeek V3.2

DeepSeek

Reasoning avg: 73.0

685B params (37B active) MoE; 90% off cache hits

DeepSeek V3

DeepSeek

Reasoning avg: 68.7

Original V3; 671B params (37B active) Speed data hidden until it is refreshed from a current live measurement source.

Phi 4

Microsoft

Reasoning avg: 68.3

14B params; excels at math reasoning; MIT license Speed data hidden until it is refreshed from a current live measurement source.

Reasoning Benchmark Rankings

All models ranked by average score across reasoning/math benchmarks (AIME 2025, ARC Challenge, GPQA Diamond, Humanity's Last Exam, LiveBench, MATH-500).

#	Model	Reasoning Avg	AIME 2025	ARC Challenge	GPQA Diamond	Quality	Price
1	GPT-5.2 Pro Reasoning OpenAI	94.8	—	—	91.0	93.0	$21.00
2	GPT-5 Pro OpenAI	92.8	—	—	88.0	90.0	$15.00
3	O4 Mini Reasoning OpenAI	90.1	92.7	—	81.4	90.0	$1.10
4	O3 Reasoning OpenAI	89.2	91.6	—	83.3	88.0	$2.00
5	o1 Reasoning OpenAI	87.2	—	—	78.0	84.0	$15.00
6	R1 0528 Reasoning OSS DeepSeek	87.1	87.5	—	76.0	83.0	$0.50
7	GPT-5 OpenAI	86.0	—	—	86.0	87.0	$1.25
8	Qwen3 235B A22B Reasoning OSS Alibaba	85.0	—	—	—	87.0	$0.46
9	o3 Mini Reasoning OpenAI	84.5	—	—	75.0	84.0	$1.10
10	Claude Sonnet 4.6 Reasoning Anthropic	83.0	—	—	—	86.0	$3.00
11	Qwen3 Max OSS Alibaba	82.8	—	—	72.0	85.0	$0.78
12	Gemini 2.5 Pro Reasoning Google	81.8	86.7	—	68.4	83.0	$1.25
13	Claude Opus 4 Anthropic	80.4	—	—	72.1	84.0	$15.00
14	Claude Opus 4.5 Anthropic	78.0	—	—	78.0	86.0	$5.00
15	O3 Pro Reasoning OpenAI	77.2	96.7	—	87.5	88.0	$20.00
16	Llama 4 Scout OSS Meta	76.8	—	—	—	79.0	$0.10
17	Claude Sonnet 4 Anthropic	76.5	—	—	67.5	79.0	$3.00
18	GPT-4.1 OpenAI	74.7	—	—	66.3	77.0	$2.00
19	DeepSeek V3.2 Reasoning OSS DeepSeek	73.0	—	—	62.0	77.0	$0.24
20	Claude Sonnet 4.5 Anthropic	72.0	—	—	72.0	79.0	$3.00
21	Gemini 2.5 Flash Reasoning Google	70.8	—	—	59.2	78.0	$0.30
22	GPT-4o-mini (2024-07-18) OpenAI	70.2	—	—	—	74.0	$0.15
23	DeepSeek R1 Reasoning OSS DeepSeek	69.3	79.8	—	71.5	85.0	$0.70
24	Llama 4 Maverick OSS Meta	68.8	—	—	56.0	76.0	$0.20
25	DeepSeek V3 Reasoning OSS DeepSeek	68.7	—	—	59.1	76.0	$0.20
26	Phi 4 Reasoning OSS Microsoft	68.3	—	—	56.1	74.0	$0.07
27	GPT-5.2 OpenAI	67.0	—	—	89.0	90.0	$1.75
28	GPT-4o (2024-05-13) OpenAI	65.1	—	—	53.6	75.0	$5.00
29	Qwen2.5 72B Instruct OSS Alibaba	64.5	—	—	49.0	71.0	$0.36
30	Mistral Large OSS Mistral	58.0	—	—	58.0	73.0	$2.00
31	Claude Opus 4.6 Reasoning Anthropic	54.3	—	—	—	89.0	$15.00
32	Llama 3.3 70B Instruct OSS Meta	50.7	—	—	50.7	71.0	$0.10

What are reasoning models?

Reasoning models (like OpenAI's o-series and DeepSeek R1) use extended "chain-of-thought" processing to work through complex problems step by step. They're particularly strong at:

Mathematics: Competition-level math problems (AIME, MATH-500)
Science: Graduate-level science questions (GPQA Diamond)
Coding: Complex software engineering tasks (SWE-Bench)
Logic: Multi-step logical deduction and constraint satisfaction

The trade-off is higher latency and cost — reasoning models "think" before responding, which takes longer but produces more accurate answers for hard problems.

Coding models → Full leaderboard → Compare head-to-head →