Arena-Hard

conversational

Arena-Hard is an automated benchmark that uses GPT-4 as a judge to evaluate model responses on 500 challenging user queries from Chatbot Arena. It approximates human preferences at a fraction of the cost.

View paper / source

7

Models Tested

92.0

Best Score

87.6

Average Score

0–100

Scale Range

1.2x

Weight

How It Works

Models generate responses to 500 difficult prompts sourced from Chatbot Arena. GPT-4-Turbo then judges each response in head-to-head comparisons against a baseline model. Win rates are calculated using the Bradley-Terry model.

Why It Matters

Arena-Hard provides a fast, reproducible proxy for Chatbot Arena rankings without requiring thousands of human votes. It correlates strongly (>0.9) with actual Arena ELO ratings, making it valuable for rapid model evaluation.

Limitations

Relies on GPT-4 as judge, which may have biases toward certain response styles. Cannot capture the full diversity of human preferences. May not accurately evaluate models that are stronger than the judge model.

Leaderboard — Arena-Hard

# Model Provider Score
🥇 GPT-5.2 OpenAI 92.0
🥈 Claude Opus 4.6 Anthropic 90.0
🥉 Grok 4 xAI 89.0
4 o3 OpenAI 88.5
5 Claude Sonnet 4.6 Anthropic 86.0
6 Gemini 2.5 Pro Preview 06-05 Google 85.5
7 R1 DeepSeek 82.0
All Benchmarks