Chatbot Arena ELO
conversationalChatbot Arena (by LMSYS) is a crowdsourced evaluation where real users have blind conversations with two anonymous models and vote for which response they prefer. Results are compiled into an ELO rating system.
View paper / source27
Models Tested
1375.0
Best Score
1306.0
Average Score
800–1400
Scale Range
1.5x
Weight
How It Works
Users interact with two anonymous models side-by-side and pick the better response. Votes are aggregated using the Bradley-Terry model (similar to chess ELO) to produce a ranking. Over 2 million human votes have been collected.
Why It Matters
Chatbot Arena is widely considered the most reliable benchmark for overall model quality because it captures real human preferences across diverse, unconstrained conversations — not just narrow academic tasks.
Limitations
Ratings reflect crowd preferences which may favour style over substance. English-language biased. User base may not be representative of all use cases. Models can be optimised for arena-style short conversations.
Leaderboard — Chatbot Arena ELO
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | Gemini 3.1 Pro Preview | 1375 | | |
| 🥈 | GPT-5.2 | OpenAI | 1370 | |
| 🥉 | Claude Opus 4.6 | Anthropic | 1365 | |
| 4 | GPT-5 | OpenAI | 1355 | |
| 5 | Claude Sonnet 4.6 | Anthropic | 1350 | |
| 6 | Grok 4 | xAI | 1345 | |
| 7 | Gemini 2.5 Pro Preview 06-05 | 1340 | | |
| 8 | o3 | OpenAI | 1337 | |
| 9 | Claude Opus 4 | Anthropic | 1330 | |
| 10 | Grok 3 Beta | xAI | 1329 | |
| 11 | Qwen3 235B A22B | Alibaba | 1320 | |
| 12 | R1 | DeepSeek | 1318 | |
| 13 | Claude Sonnet 4 | Anthropic | 1310 | |
| 14 | DeepSeek V3 0324 | DeepSeek | 1310 | |
| 15 | Gemini 2.5 Flash | 1300 | | |
| 16 | Mistral Large | Mistral | 1295 | |
| 17 | Llama 4 Maverick | Meta | 1290 | |
| 18 | GPT-4o (2024-05-13) | OpenAI | 1285 | |
| 19 | GPT-4.1 | OpenAI | 1283 | |
| 20 | Command A | Cohere | 1280 | |
| 21 | DeepSeek V3 | DeepSeek | 1275 | |
| 22 | Gemini 2.0 Flash | 1270 | | |
| 23 | Claude 3.5 Haiku | Anthropic | 1260 | |
| 24 | Llama 3.3 70B Instruct | Meta | 1250 | |
| 25 | Qwen2.5 72B Instruct | Alibaba | 1245 | |
| 26 | GPT-4o-mini (2024-07-18) | OpenAI | 1240 | |
| 27 | Mistral Small 3.1 24B | Mistral | 1235 | |