GAIA

agent

GAIA (General AI Assistants) tests real-world assistant tasks requiring reasoning, web browsing, multimodality, and tool use. Questions are easy for humans (~92% accuracy) but very hard for AI (~15% for GPT-4 at launch).

View paper / source

5

Models Tested

78.0

Best Score

72.6

Average Score

0–100

Scale Range

1.3x

Weight

How It Works

Models receive tasks that require multiple steps: searching the web, reading documents, performing calculations, and synthesising information. Tasks are graded on final answer correctness.

Why It Matters

GAIA tests practical AI assistant capability rather than academic knowledge. The large gap between human and AI performance makes it an excellent discriminator for real-world usefulness.

Limitations

Requires tool access (web browsing, code execution) which varies by implementation. Some tasks may be time-sensitive. The gap between human and AI performance is closing rapidly.

Leaderboard — GAIA

# Model Provider Score
🥇 GPT-5.2 OpenAI 78.0
🥈 Claude Opus 4.6 Anthropic 75.0
🥉 o3 OpenAI 72.0
4 Grok 4 xAI 70.0
5 Gemini 2.5 Pro Preview 06-05 Google 68.0
All Benchmarks