AI Reasoning Models

Models ranked by reasoning, math, and logic benchmark performance. Reasoning models use extended "thinking" to solve complex multi-step problems.

Dedicated Reasoning Models

Models specifically designed for extended reasoning and chain-of-thought problem solving.

#1

O4 Mini

OpenAI

90.0/100 $1.10
Reasoning avg: 90.1

Fast reasoning; excels at math/code

#2

Claude Opus 4.6

Anthropic

89.0/100 $15.00
Reasoning avg: 54.3

Most capable; 1M context beta; adaptive thinking Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model. Anthropic flagship model.

#3

Grok 4

xAI

88.0/100 $3.00
Reasoning avg: 70.5

Frontier reasoning (matches o3 quality)

#4

O3

OpenAI

88.0/100 $2.00
Reasoning avg: 89.2

Reasoning model; 80% price cut from launch

#5

O3 Pro

OpenAI

88.0/100 $20.00
Reasoning avg: 77.2

Highest reasoning quality Speed data hidden until it is refreshed from a current live measurement source.

#6

Qwen3 235B A22B

Alibaba

87.0/100 $0.46
Reasoning avg: 85.0

Thinking mode: $0.65/$3.00

#7

Claude Sonnet 4.6

Anthropic

86.0/100 $3.00
Reasoning avg: 83.0

Default model; extended thinking Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model. Anthropic balanced frontier model.

#8

DeepSeek V3.2

DeepSeek

86.0/100 $0.20

685B params (37B active) MoE; 90% off cache hits

#9

DeepSeek R1

DeepSeek

85.0/100 $0.70
Reasoning avg: 69.3

Open-weight reasoning; CoT tokens billed as output

#10

Gemini 2.5 Pro

Google

83.0/100 $1.25
Reasoning avg: 81.8

Thinking model with 1M context

#11

Gemini 2.5 Flash

Google

78.0/100 $0.30
Reasoning avg: 70.8

With thinking: $3.50 output

#12

QwQ 32B

Alibaba

78.0/100 $0.15
Reasoning avg: 77.7

Reasoning model on Qwen2.5 base Speed data hidden until it is refreshed from a current live measurement source.

#13

DeepSeek V3

DeepSeek

76.0/100 $0.32
Reasoning avg: 68.7

Original V3; 671B params (37B active) Speed data hidden until it is refreshed from a current live measurement source.

Reasoning Benchmark Rankings

All models ranked by average score across reasoning/math benchmarks (AIME 2025, ARC Challenge, GPQA Diamond, Humanity's Last Exam, LiveBench, MATH-500).

# Model Reasoning Avg
1 O4 Mini Reasoning OpenAI 90.1
2 O3 Reasoning OpenAI 89.2
3 GPT-5 OpenAI 86.0
4 Qwen3 235B A22B Reasoning OSS Alibaba 85.0
5 Claude Sonnet 4.6 Reasoning Anthropic 83.0
6 Gemini 2.5 Pro Reasoning Google 81.8
7 Grok 3 Beta xAI 81.2
8 Claude Opus 4 Anthropic 80.4
9 QwQ 32B Reasoning OSS Alibaba 77.7
10 O3 Pro Reasoning OpenAI 77.2
11 Claude Sonnet 4 Anthropic 76.5
12 GPT-4.1 OpenAI 74.7
13 Gemini 2.5 Flash Reasoning Google 70.8
14 Grok 4 Reasoning xAI 70.5
15 DeepSeek R1 Reasoning OSS DeepSeek 69.3
16 DeepSeek V3 Reasoning OSS DeepSeek 68.7
17 GPT-5.2 OpenAI 67.0
18 GPT-4o (extended) OpenAI 65.1
19 Qwen2.5 72B Instruct OSS Alibaba 64.5
20 Llama 4 Maverick OSS Meta 56.0
21 Claude Opus 4.6 Reasoning Anthropic 54.3

What are reasoning models?

Reasoning models (like OpenAI's o-series and DeepSeek R1) use extended "chain-of-thought" processing to work through complex problems step by step. They're particularly strong at:

  • Mathematics: Competition-level math problems (AIME, MATH-500)
  • Science: Graduate-level science questions (GPQA Diamond)
  • Coding: Complex software engineering tasks (SWE-Bench)
  • Logic: Multi-step logical deduction and constraint satisfaction

The trade-off is higher latency and cost — reasoning models "think" before responding, which takes longer but produces more accurate answers for hard problems.