BFCL

coding

The Berkeley Function Calling Leaderboard (BFCL) tests how well models can invoke tools and functions correctly, covering simple calls, parallel calls, multi-turn conversations, and knowing when NOT to call a function.

View paper / source

Models Tested

92.0

Best Score

86.8

Average Score

0–100

Scale Range

Weight

How It Works

Models receive function definitions and user queries, then must generate correct function calls. Tests include simple single-function calls, parallel execution, nested calls, and relevance detection (avoiding unnecessary calls).

Why It Matters

Function calling is critical for AI agents and tool use. BFCL is the most comprehensive benchmark for this capability, directly measuring whether models can reliably interact with external APIs and tools.

Limitations

Function calling formats vary between providers, which can affect scores. Does not test execution of the functions themselves, only the call generation. Simulated environment may not capture all real-world edge cases.

Leaderboard — BFCL

#	Model	Provider	Score	Source	Measured
🥇	GPT-5.2	OpenAI	92.0	OpenAI	Dec 2025
🥈	Claude Sonnet 4	Anthropic	88.0	Anthropic	May 2025
🥉	Gemini 2.5 Pro Preview 06-05	Google	87.0	Google	Mar 2025
4	GPT-4o	OpenAI	85.0	OpenAI	May 2024
5	Grok 3 Beta	xAI	82.0	xAI	Feb 2025

All Benchmarks