GAIA
agentGAIA (General AI Assistants) tests real-world assistant tasks requiring reasoning, web browsing, multimodality, and tool use. Questions are easy for humans (~92% accuracy) but very hard for AI (~15% for GPT-4 at launch).
View paper / source5
Models Tested
78.0
Best Score
72.6
Average Score
0–100
Scale Range
1.3x
Weight
How It Works
Models receive tasks that require multiple steps: searching the web, reading documents, performing calculations, and synthesising information. Tasks are graded on final answer correctness.
Why It Matters
GAIA tests practical AI assistant capability rather than academic knowledge. The large gap between human and AI performance makes it an excellent discriminator for real-world usefulness.
Limitations
Requires tool access (web browsing, code execution) which varies by implementation. Some tasks may be time-sensitive. The gap between human and AI performance is closing rapidly.
Leaderboard — GAIA
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | GPT-5.2 | OpenAI | 78.0 | |
| 🥈 | Claude Opus 4.6 | Anthropic | 75.0 | |
| 🥉 | o3 | OpenAI | 72.0 | |
| 4 | Grok 4 | xAI | 70.0 | |
| 5 | Gemini 2.5 Pro Preview 06-05 | 68.0 | |