LiveBench
reasoningLiveBench is a contamination-free benchmark that sources fresh questions monthly from recent mathematical competitions, coding contests, scientific papers, and news articles that post-date model training cutoffs.
View paper / source8
Models Tested
88.0
Best Score
83.3
Average Score
0–100
Scale Range
1.3x
Weight
How It Works
New questions are generated each month from recently published sources. Models answer questions across 6 categories: math, coding, reasoning, language, data analysis, and instruction following. All answers are verified objectively — no LLM-as-judge.
Why It Matters
Data contamination is one of the biggest problems in AI evaluation — models may have memorised benchmark answers during training. LiveBench solves this by continuously using new questions, giving a more honest picture of true capability.
Limitations
Monthly updates make historical comparisons complex. Question difficulty may vary between months. Newer models may still have indirect exposure through similar (not identical) training examples.
Leaderboard — LiveBench
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | GPT-5.2 | OpenAI | 88.0 | |
| 🥈 | Claude Opus 4.6 | Anthropic | 86.5 | |
| 🥉 | o3 | OpenAI | 85.0 | |
| 4 | Grok 4 | xAI | 84.0 | |
| 5 | Claude Sonnet 4.6 | Anthropic | 83.0 | |
| 6 | Gemini 2.5 Pro Preview 06-05 | 82.0 | | |
| 7 | R1 | DeepSeek | 80.0 | |
| 8 | Qwen3 235B A22B | Alibaba | 78.0 | |