BFCL

coding

The Berkeley Function Calling Leaderboard (BFCL) tests how well models can invoke tools and functions correctly, covering simple calls, parallel calls, multi-turn conversations, and knowing when NOT to call a function.

View paper / source

5

Models Tested

92.0

Best Score

86.8

Average Score

0–100

Scale Range

1x

Weight

How It Works

Models receive function definitions and user queries, then must generate correct function calls. Tests include simple single-function calls, parallel execution, nested calls, and relevance detection (avoiding unnecessary calls).

Why It Matters

Function calling is critical for AI agents and tool use. BFCL is the most comprehensive benchmark for this capability, directly measuring whether models can reliably interact with external APIs and tools.

Limitations

Function calling formats vary between providers, which can affect scores. Does not test execution of the functions themselves, only the call generation. Simulated environment may not capture all real-world edge cases.

Leaderboard — BFCL

# Model Provider Score
🥇 GPT-5.2 OpenAI 92.0
🥈 Claude Sonnet 4 Anthropic 88.0
🥉 Gemini 2.5 Pro Preview 06-05 Google 87.0
4 GPT-4o OpenAI 85.0
5 Grok 3 Beta xAI 82.0
All Benchmarks