Arena-Hard
conversationalArena-Hard is an automated benchmark that uses GPT-4 as a judge to evaluate model responses on 500 challenging user queries from Chatbot Arena. It approximates human preferences at a fraction of the cost.
View paper / source7
Models Tested
92.0
Best Score
87.6
Average Score
0–100
Scale Range
1.2x
Weight
How It Works
Models generate responses to 500 difficult prompts sourced from Chatbot Arena. GPT-4-Turbo then judges each response in head-to-head comparisons against a baseline model. Win rates are calculated using the Bradley-Terry model.
Why It Matters
Arena-Hard provides a fast, reproducible proxy for Chatbot Arena rankings without requiring thousands of human votes. It correlates strongly (>0.9) with actual Arena ELO ratings, making it valuable for rapid model evaluation.
Limitations
Relies on GPT-4 as judge, which may have biases toward certain response styles. Cannot capture the full diversity of human preferences. May not accurately evaluate models that are stronger than the judge model.
Leaderboard — Arena-Hard
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | GPT-5.2 | OpenAI | 92.0 | |
| 🥈 | Claude Opus 4.6 | Anthropic | 90.0 | |
| 🥉 | Grok 4 | xAI | 89.0 | |
| 4 | o3 | OpenAI | 88.5 | |
| 5 | Claude Sonnet 4.6 | Anthropic | 86.0 | |
| 6 | Gemini 2.5 Pro Preview 06-05 | 85.5 | | |
| 7 | R1 | DeepSeek | 82.0 | |