MT-Bench

conversational

MT-Bench (Multi-Turn Benchmark) evaluates conversational ability through 80 carefully designed multi-turn questions across 8 categories including writing, roleplay, reasoning, math, coding, STEM, humanities, and extraction.

View paper / source

Models Tested

9.5

Best Score

9.0

Average Score

0–10

Scale Range

0.8x

Weight

How It Works

Models engage in two-turn conversations (question → response → follow-up → response). GPT-4 scores each turn on a scale of 1-10. The benchmark tests whether models can maintain coherence and quality across multiple exchanges.

Why It Matters

Real conversations are multi-turn, but most benchmarks only test single-turn responses. MT-Bench specifically evaluates the ability to build on previous context, handle follow-ups, and maintain consistency.

Limitations

Only 80 questions limits statistical power. GPT-4 judging introduces bias. Two turns is still relatively short compared to real conversations. The 1-10 scale makes most good models cluster around 8-9.

Leaderboard — MT-Bench

#	Model	Provider	Score	Source	Measured
🥇	GPT-5.2	OpenAI	9.5	OpenAI	Dec 2025
🥈	Claude Opus 4.6	Anthropic	9.4	Anthropic	Feb 2026
🥉	Grok 4	xAI	9.3	xAI	Jul 2025
4	Gemini 2.5 Pro Preview 06-05	Google	9.2	Google	Mar 2025
5	o3	OpenAI	9.1	OpenAI	Apr 2025
6	R1	DeepSeek	8.9	DeepSeek	Jan 2025
7	Claude Sonnet 4	Anthropic	8.8	Anthropic	May 2025
8	GPT-4o (2024-05-13)	OpenAI	8.6	OpenAI	May 2024
9	Llama 4 Maverick	Meta	8.5	Meta	Apr 2025

All Benchmarks