TAU-bench

agent

TAU-bench evaluates Tool-Agent-User interaction quality in real-world domains, testing whether agents correctly chain tool calls, handle errors, and maintain conversation coherence across multi-step scenarios.

8

Models Tested

72.0

Best Score

63.6

Average Score

0–100

Scale Range

1x

Weight

How It Works

Agents interact with simulated users and tools across domains like customer service, data analysis, and scheduling. Evaluation measures both task completion and interaction quality (e.g. not making unnecessary tool calls).

Why It Matters

Real-world agent deployment requires not just tool proficiency but good interaction patterns. TAU-bench uniquely evaluates the full loop of user understanding, tool selection, execution, and response quality.

Limitations

Simulated users may not capture real human behaviour. Domain-specific scenarios may not generalise. Interaction quality is partially subjective to evaluate.

Leaderboard — TAU-bench

# Model Provider Score
🥇 GPT-5.2 OpenAI 72.0
🥈 Claude Opus 4.6 Anthropic 70.0
🥉 o3 OpenAI 68.0
4 Grok 4 xAI 66.0
5 Gemini 2.5 Pro Preview 06-05 Google 64.0
6 Claude Opus 4 Anthropic 62.0
7 R1 DeepSeek 55.0
8 GPT-4o OpenAI 52.0
All Benchmarks