Arena-Hard

conversational

Arena-Hard is an automated benchmark that uses GPT-4 as a judge to evaluate model responses on 500 challenging user queries from Chatbot Arena. It approximates human preferences at a fraction of the cost.

View paper / source

Models Tested

92.0

Best Score

87.6

Average Score

0–100

Scale Range

1.2x

Weight

How It Works

Models generate responses to 500 difficult prompts sourced from Chatbot Arena. GPT-4-Turbo then judges each response in head-to-head comparisons against a baseline model. Win rates are calculated using the Bradley-Terry model.

Why It Matters

Arena-Hard provides a fast, reproducible proxy for Chatbot Arena rankings without requiring thousands of human votes. It correlates strongly (>0.9) with actual Arena ELO ratings, making it valuable for rapid model evaluation.

Limitations

Relies on GPT-4 as judge, which may have biases toward certain response styles. Cannot capture the full diversity of human preferences. May not accurately evaluate models that are stronger than the judge model.

Leaderboard — Arena-Hard

#	Model	Provider	Score	Source	Measured
🥇	GPT-5.2	OpenAI	92.0	OpenAI	Dec 2025
🥈	Claude Opus 4.6	Anthropic	90.0	Anthropic	Feb 2026
🥉	Grok 4	xAI	89.0	xAI	Jul 2025
4	o3	OpenAI	88.5	OpenAI	Apr 2025
5	Claude Sonnet 4.6	Anthropic	86.0	Anthropic	Feb 2026
6	Gemini 2.5 Pro Preview 06-05	Google	85.5	Google	Mar 2025
7	R1	DeepSeek	82.0	DeepSeek	Jan 2025

All Benchmarks