MT-Bench
conversationalMT-Bench (Multi-Turn Benchmark) evaluates conversational ability through 80 carefully designed multi-turn questions across 8 categories including writing, roleplay, reasoning, math, coding, STEM, humanities, and extraction.
View paper / source9
Models Tested
9.5
Best Score
9.0
Average Score
0–10
Scale Range
0.8x
Weight
How It Works
Models engage in two-turn conversations (question → response → follow-up → response). GPT-4 scores each turn on a scale of 1-10. The benchmark tests whether models can maintain coherence and quality across multiple exchanges.
Why It Matters
Real conversations are multi-turn, but most benchmarks only test single-turn responses. MT-Bench specifically evaluates the ability to build on previous context, handle follow-ups, and maintain consistency.
Limitations
Only 80 questions limits statistical power. GPT-4 judging introduces bias. Two turns is still relatively short compared to real conversations. The 1-10 scale makes most good models cluster around 8-9.
Leaderboard — MT-Bench
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | GPT-5.2 | OpenAI | 9.5 | |
| 🥈 | Claude Opus 4.6 | Anthropic | 9.4 | |
| 🥉 | Grok 4 | xAI | 9.3 | |
| 4 | Gemini 2.5 Pro Preview 06-05 | 9.2 | | |
| 5 | o3 | OpenAI | 9.1 | |
| 6 | R1 | DeepSeek | 8.9 | |
| 7 | Claude Sonnet 4 | Anthropic | 8.8 | |
| 8 | GPT-4o (2024-05-13) | OpenAI | 8.6 | |
| 9 | Llama 4 Maverick | Meta | 8.5 | |