LiveBench

reasoning

LiveBench is a contamination-free benchmark that sources fresh questions monthly from recent mathematical competitions, coding contests, scientific papers, and news articles that post-date model training cutoffs.

View paper / source

Models Tested

88.0

Best Score

83.3

Average Score

0–100

Scale Range

1.3x

Weight

How It Works

New questions are generated each month from recently published sources. Models answer questions across 6 categories: math, coding, reasoning, language, data analysis, and instruction following. All answers are verified objectively — no LLM-as-judge.

Why It Matters

Data contamination is one of the biggest problems in AI evaluation — models may have memorised benchmark answers during training. LiveBench solves this by continuously using new questions, giving a more honest picture of true capability.

Limitations

Monthly updates make historical comparisons complex. Question difficulty may vary between months. Newer models may still have indirect exposure through similar (not identical) training examples.

Leaderboard — LiveBench

#	Model	Provider	Score	Source	Measured
🥇	GPT-5.2	OpenAI	88.0	OpenAI	Dec 2025
🥈	Claude Opus 4.6	Anthropic	86.5	Anthropic	Feb 2026
🥉	o3	OpenAI	85.0	OpenAI	Apr 2025
4	Grok 4	xAI	84.0	xAI	Jul 2025
5	Claude Sonnet 4.6	Anthropic	83.0	Anthropic	Feb 2026
6	Gemini 2.5 Pro Preview 06-05	Google	82.0	Google	Mar 2025
7	R1	DeepSeek	80.0	DeepSeek	Jan 2025
8	Qwen3 235B A22B	Alibaba	78.0	Alibaba	Apr 2025

All Benchmarks