FinQA
domainFinancial question answering over earnings reports — numerical reasoning on real SEC filings
View paper / source10
Models Tested
85.0
Best Score
77.7
Average Score
0–100
Scale Range
0.8x
Weight
How It Works
Models are evaluated according to the benchmark's standardised protocol.
Why It Matters
This benchmark helps compare AI model capabilities in a standardised way.
Limitations
All benchmarks have limitations and should be considered alongside other evaluations.
Leaderboard — FinQA
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | GPT-5.2 | OpenAI | 85.0 | |
| 🥈 | Claude Opus 4.6 | Anthropic | 83.0 | |
| 🥉 | o3 | OpenAI | 82.0 | |
| 4 | Gemini 2.5 Pro Preview 06-05 | 80.0 | | |
| 5 | Grok 4 | xAI | 79.0 | |
| 6 | Claude Opus 4 | Anthropic | 78.0 | |
| 7 | R1 | DeepSeek | 76.0 | |
| 8 | Claude Sonnet 4 | Anthropic | 74.0 | |
| 9 | GPT-4o | OpenAI | 72.0 | |
| 10 | Llama 4 Maverick | Meta | 68.0 | |