Chatbot Arena ELO

conversational

Chatbot Arena (by LMSYS) is a crowdsourced evaluation where real users have blind conversations with two anonymous models and vote for which response they prefer. Results are compiled into an ELO rating system.

View paper / source

27

Models Tested

1375.0

Best Score

1306.0

Average Score

800–1400

Scale Range

1.5x

Weight

How It Works

Users interact with two anonymous models side-by-side and pick the better response. Votes are aggregated using the Bradley-Terry model (similar to chess ELO) to produce a ranking. Over 2 million human votes have been collected.

Why It Matters

Chatbot Arena is widely considered the most reliable benchmark for overall model quality because it captures real human preferences across diverse, unconstrained conversations — not just narrow academic tasks.

Limitations

Ratings reflect crowd preferences which may favour style over substance. English-language biased. User base may not be representative of all use cases. Models can be optimised for arena-style short conversations.

Leaderboard — Chatbot Arena ELO

# Model Provider Score
🥇 Gemini 3.1 Pro Preview Google 1375
🥈 GPT-5.2 OpenAI 1370
🥉 Claude Opus 4.6 Anthropic 1365
4 GPT-5 OpenAI 1355
5 Claude Sonnet 4.6 Anthropic 1350
6 Grok 4 xAI 1345
7 Gemini 2.5 Pro Preview 06-05 Google 1340
8 o3 OpenAI 1337
9 Claude Opus 4 Anthropic 1330
10 Grok 3 Beta xAI 1329
11 Qwen3 235B A22B Alibaba 1320
12 R1 DeepSeek 1318
13 Claude Sonnet 4 Anthropic 1310
14 DeepSeek V3 0324 DeepSeek 1310
15 Gemini 2.5 Flash Google 1300
16 Mistral Large Mistral 1295
17 Llama 4 Maverick Meta 1290
18 GPT-4o (2024-05-13) OpenAI 1285
19 GPT-4.1 OpenAI 1283
20 Command A Cohere 1280
21 DeepSeek V3 DeepSeek 1275
22 Gemini 2.0 Flash Google 1270
23 Claude 3.5 Haiku Anthropic 1260
24 Llama 3.3 70B Instruct Meta 1250
25 Qwen2.5 72B Instruct Alibaba 1245
26 GPT-4o-mini (2024-07-18) OpenAI 1240
27 Mistral Small 3.1 24B Mistral 1235
All Benchmarks