What's new

AI Reasoning Models

Models ranked by reasoning, math, and logic benchmark performance. Reasoning models use extended "thinking" to solve complex multi-step problems.

Dedicated Reasoning Models

Models specifically designed for extended reasoning and chain-of-thought problem solving.

#1

GPT-5.2 Pro

OpenAI

93.0/100 $21.00
Reasoning avg: 94.8

Premium reasoning tier Speed data hidden until it is refreshed from a current live measurement source.

#2

O4 Mini

OpenAI

90.0/100 $1.10
Reasoning avg: 90.1

Fast reasoning; excels at math/code

#3

Claude Opus 4.6

Anthropic

89.0/100 $15.00
Reasoning avg: 54.3

Most capable; 1M context beta; adaptive thinking Anthropic flagship model.

#4

O3

OpenAI

88.0/100 $2.00
Reasoning avg: 89.2

Reasoning model; 80% price cut from launch

#5

O3 Pro

OpenAI

88.0/100 $20.00
Reasoning avg: 77.2

Highest reasoning quality Speed data hidden until it is refreshed from a current live measurement source.

#6

Qwen3 235B A22B

Alibaba

87.0/100 $0.46
Reasoning avg: 85.0

Thinking mode: $0.65/$3.00

#7

Claude Sonnet 4.6

Anthropic

86.0/100 $3.00
Reasoning avg: 83.0

Default model; extended thinking Anthropic balanced frontier model.

#8

DeepSeek R1

DeepSeek

85.0/100 $0.70
Reasoning avg: 69.3

Open-weight reasoning; CoT tokens billed as output

#9

o1

OpenAI

84.0/100 $15.00
Reasoning avg: 87.2

First-gen reasoning model; chain-of-thought Speed data hidden until it is refreshed from a current live measurement source.

#10

o3 Mini

OpenAI

84.0/100 $1.10
Reasoning avg: 84.5

Cost-efficient reasoning

#11

Gemini 2.5 Pro

Google

83.0/100 $1.25
Reasoning avg: 81.8

Thinking model with 1M context

#12

R1 0528

DeepSeek

83.0/100 $0.50
Reasoning avg: 87.1

Updated R1 with improved reasoning Speed data hidden until it is refreshed from a current live measurement source.

#13

Gemini 2.5 Flash

Google

78.0/100 $0.30
Reasoning avg: 70.8

With thinking: $3.50 output

#14

DeepSeek V3.2

DeepSeek

77.0/100 $0.20
Reasoning avg: 73.0

685B params (37B active) MoE; 90% off cache hits

#15

DeepSeek V3

DeepSeek

76.0/100 $0.23
Reasoning avg: 68.7

Original V3; 671B params (37B active) Speed data hidden until it is refreshed from a current live measurement source.

#16

Phi 4

Microsoft

74.0/100 $0.07
Reasoning avg: 68.3

14B params; excels at math reasoning; MIT license Speed data hidden until it is refreshed from a current live measurement source.

Reasoning Benchmark Rankings

All models ranked by average score across reasoning/math benchmarks (AIME 2025, ARC Challenge, GPQA Diamond, Humanity's Last Exam, LiveBench, MATH-500).

# Model Reasoning Avg
1 GPT-5.2 Pro Reasoning OpenAI 94.8
2 GPT-5 Pro OpenAI 92.8
3 O4 Mini Reasoning OpenAI 90.1
4 O3 Reasoning OpenAI 89.2
5 o1 Reasoning OpenAI 87.2
6 R1 0528 Reasoning OSS DeepSeek 87.1
7 GPT-5 OpenAI 86.0
8 Qwen3 235B A22B Reasoning OSS Alibaba 85.0
9 o3 Mini Reasoning OpenAI 84.5
10 Claude Sonnet 4.6 Reasoning Anthropic 83.0
11 Qwen3 Max OSS Alibaba 82.8
12 Gemini 2.5 Pro Reasoning Google 81.8
13 Claude Opus 4 Anthropic 80.4
14 Claude Opus 4.5 Anthropic 78.0
15 O3 Pro Reasoning OpenAI 77.2
16 Llama 4 Scout OSS Meta 76.8
17 Claude Sonnet 4 Anthropic 76.5
18 GPT-4.1 OpenAI 74.7
19 DeepSeek V3.2 Reasoning OSS DeepSeek 73.0
20 Claude Sonnet 4.5 Anthropic 72.0
21 Gemini 2.5 Flash Reasoning Google 70.8
22 GPT-4o-mini OpenAI 70.2
23 DeepSeek R1 Reasoning OSS DeepSeek 69.3
24 Claude 3.5 Haiku Anthropic 69.3
25 Llama 4 Maverick OSS Meta 68.8
26 DeepSeek V3 Reasoning OSS DeepSeek 68.7
27 Phi 4 Reasoning OSS Microsoft 68.3
28 GPT-5.2 OpenAI 67.0
29 GPT-4o (2024-05-13) OpenAI 65.1
30 Qwen2.5 72B Instruct OSS Alibaba 64.5
31 Mistral Large OSS Mistral 58.0
32 Claude Opus 4.6 Reasoning Anthropic 54.3
33 Llama 3.3 70B Instruct OSS Meta 50.7

What are reasoning models?

Reasoning models (like OpenAI's o-series and DeepSeek R1) use extended "chain-of-thought" processing to work through complex problems step by step. They're particularly strong at:

  • Mathematics: Competition-level math problems (AIME, MATH-500)
  • Science: Graduate-level science questions (GPQA Diamond)
  • Coding: Complex software engineering tasks (SWE-Bench)
  • Logic: Multi-step logical deduction and constraint satisfaction

The trade-off is higher latency and cost — reasoning models "think" before responding, which takes longer but produces more accurate answers for hard problems.