TAU-bench

agent

TAU-bench evaluates Tool-Agent-User interaction quality in real-world domains, testing whether agents correctly chain tool calls, handle errors, and maintain conversation coherence across multi-step scenarios.

Models Tested

72.0

Best Score

63.6

Average Score

0–100

Scale Range

Weight

How It Works

Agents interact with simulated users and tools across domains like customer service, data analysis, and scheduling. Evaluation measures both task completion and interaction quality (e.g. not making unnecessary tool calls).

Why It Matters

Real-world agent deployment requires not just tool proficiency but good interaction patterns. TAU-bench uniquely evaluates the full loop of user understanding, tool selection, execution, and response quality.

Limitations

Simulated users may not capture real human behaviour. Domain-specific scenarios may not generalise. Interaction quality is partially subjective to evaluate.

Leaderboard — TAU-bench

#	Model	Provider	Score	Source	Measured
🥇	GPT-5.2	OpenAI	72.0	OpenAI	Dec 2025
🥈	Claude Opus 4.6	Anthropic	70.0	Anthropic	Feb 2026
🥉	o3	OpenAI	68.0	OpenAI	Apr 2025
4	Grok 4	xAI	66.0	xAI	Jul 2025
5	Gemini 2.5 Pro Preview 06-05	Google	64.0	Google	Mar 2025
6	Claude Opus 4	Anthropic	62.0	Anthropic	May 2025
7	R1	DeepSeek	55.0	DeepSeek	Jan 2025
8	GPT-4o	OpenAI	52.0	OpenAI	May 2024

All Benchmarks