TAU-bench
agentTAU-bench evaluates Tool-Agent-User interaction quality in real-world domains, testing whether agents correctly chain tool calls, handle errors, and maintain conversation coherence across multi-step scenarios.
8
Models Tested
72.0
Best Score
63.6
Average Score
0–100
Scale Range
1x
Weight
How It Works
Agents interact with simulated users and tools across domains like customer service, data analysis, and scheduling. Evaluation measures both task completion and interaction quality (e.g. not making unnecessary tool calls).
Why It Matters
Real-world agent deployment requires not just tool proficiency but good interaction patterns. TAU-bench uniquely evaluates the full loop of user understanding, tool selection, execution, and response quality.
Limitations
Simulated users may not capture real human behaviour. Domain-specific scenarios may not generalise. Interaction quality is partially subjective to evaluate.