MT-Bench

conversational

MT-Bench (Multi-Turn Benchmark) evaluates conversational ability through 80 carefully designed multi-turn questions across 8 categories including writing, roleplay, reasoning, math, coding, STEM, humanities, and extraction.

View paper / source

9

Models Tested

9.5

Best Score

9.0

Average Score

0–10

Scale Range

0.8x

Weight

How It Works

Models engage in two-turn conversations (question → response → follow-up → response). GPT-4 scores each turn on a scale of 1-10. The benchmark tests whether models can maintain coherence and quality across multiple exchanges.

Why It Matters

Real conversations are multi-turn, but most benchmarks only test single-turn responses. MT-Bench specifically evaluates the ability to build on previous context, handle follow-ups, and maintain consistency.

Limitations

Only 80 questions limits statistical power. GPT-4 judging introduces bias. Two turns is still relatively short compared to real conversations. The 1-10 scale makes most good models cluster around 8-9.

Leaderboard — MT-Bench

# Model Provider Score
🥇 GPT-5.2 OpenAI 9.5
🥈 Claude Opus 4.6 Anthropic 9.4
🥉 Grok 4 xAI 9.3
4 Gemini 2.5 Pro Preview 06-05 Google 9.2
5 o3 OpenAI 9.1
6 R1 DeepSeek 8.9
7 Claude Sonnet 4 Anthropic 8.8
8 GPT-4o (2024-05-13) OpenAI 8.6
9 Llama 4 Maverick Meta 8.5
All Benchmarks