AI Reasoning Models
Models ranked by reasoning, math, and logic benchmark performance. Reasoning models use extended "thinking" to solve complex multi-step problems.
Dedicated Reasoning Models
Models specifically designed for extended reasoning and chain-of-thought problem solving.
GPT-5.2 Pro
OpenAI
Premium reasoning tier Speed data hidden until it is refreshed from a current live measurement source.
O4 Mini
OpenAI
Fast reasoning; excels at math/code
Claude Opus 4.6
Anthropic
Most capable; 1M context beta; adaptive thinking Anthropic flagship model.
O3
OpenAI
Reasoning model; 80% price cut from launch
O3 Pro
OpenAI
Highest reasoning quality Speed data hidden until it is refreshed from a current live measurement source.
Qwen3 235B A22B
Alibaba
Thinking mode: $0.65/$3.00
Claude Sonnet 4.6
Anthropic
Default model; extended thinking Anthropic balanced frontier model.
DeepSeek R1
DeepSeek
Open-weight reasoning; CoT tokens billed as output
o1
OpenAI
First-gen reasoning model; chain-of-thought Speed data hidden until it is refreshed from a current live measurement source.
o3 Mini
OpenAI
Cost-efficient reasoning
Gemini 2.5 Pro
Thinking model with 1M context
R1 0528
DeepSeek
Updated R1 with improved reasoning Speed data hidden until it is refreshed from a current live measurement source.
Gemini 2.5 Flash
With thinking: $3.50 output
DeepSeek V3.2
DeepSeek
685B params (37B active) MoE; 90% off cache hits
DeepSeek V3
DeepSeek
Original V3; 671B params (37B active) Speed data hidden until it is refreshed from a current live measurement source.
Phi 4
Microsoft
14B params; excels at math reasoning; MIT license Speed data hidden until it is refreshed from a current live measurement source.
Reasoning Benchmark Rankings
All models ranked by average score across reasoning/math benchmarks (AIME 2025, ARC Challenge, GPQA Diamond, Humanity's Last Exam, LiveBench, MATH-500).
What are reasoning models?
Reasoning models (like OpenAI's o-series and DeepSeek R1) use extended "chain-of-thought" processing to work through complex problems step by step. They're particularly strong at:
- Mathematics: Competition-level math problems (AIME, MATH-500)
- Science: Graduate-level science questions (GPQA Diamond)
- Coding: Complex software engineering tasks (SWE-Bench)
- Logic: Multi-step logical deduction and constraint satisfaction
The trade-off is higher latency and cost — reasoning models "think" before responding, which takes longer but produces more accurate answers for hard problems.