LiveBench

reasoning

LiveBench is a contamination-free benchmark that sources fresh questions monthly from recent mathematical competitions, coding contests, scientific papers, and news articles that post-date model training cutoffs.

View paper / source

8

Models Tested

88.0

Best Score

83.3

Average Score

0–100

Scale Range

1.3x

Weight

How It Works

New questions are generated each month from recently published sources. Models answer questions across 6 categories: math, coding, reasoning, language, data analysis, and instruction following. All answers are verified objectively — no LLM-as-judge.

Why It Matters

Data contamination is one of the biggest problems in AI evaluation — models may have memorised benchmark answers during training. LiveBench solves this by continuously using new questions, giving a more honest picture of true capability.

Limitations

Monthly updates make historical comparisons complex. Question difficulty may vary between months. Newer models may still have indirect exposure through similar (not identical) training examples.

Leaderboard — LiveBench

# Model Provider Score
🥇 GPT-5.2 OpenAI 88.0
🥈 Claude Opus 4.6 Anthropic 86.5
🥉 o3 OpenAI 85.0
4 Grok 4 xAI 84.0
5 Claude Sonnet 4.6 Anthropic 83.0
6 Gemini 2.5 Pro Preview 06-05 Google 82.0
7 R1 DeepSeek 80.0
8 Qwen3 235B A22B Alibaba 78.0
All Benchmarks