BFCL
codingThe Berkeley Function Calling Leaderboard (BFCL) tests how well models can invoke tools and functions correctly, covering simple calls, parallel calls, multi-turn conversations, and knowing when NOT to call a function.
View paper / source5
Models Tested
92.0
Best Score
86.8
Average Score
0–100
Scale Range
1x
Weight
How It Works
Models receive function definitions and user queries, then must generate correct function calls. Tests include simple single-function calls, parallel execution, nested calls, and relevance detection (avoiding unnecessary calls).
Why It Matters
Function calling is critical for AI agents and tool use. BFCL is the most comprehensive benchmark for this capability, directly measuring whether models can reliably interact with external APIs and tools.
Limitations
Function calling formats vary between providers, which can affect scores. Does not test execution of the functions themselves, only the call generation. Simulated environment may not capture all real-world edge cases.
Leaderboard — BFCL
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | GPT-5.2 | OpenAI | 92.0 | |
| 🥈 | Claude Sonnet 4 | Anthropic | 88.0 | |
| 🥉 | Gemini 2.5 Pro Preview 06-05 | 87.0 | | |
| 4 | GPT-4o | OpenAI | 85.0 | |
| 5 | Grok 3 Beta | xAI | 82.0 | |