GAIA

agent

GAIA (General AI Assistants) tests real-world assistant tasks requiring reasoning, web browsing, multimodality, and tool use. Questions are easy for humans (~92% accuracy) but very hard for AI (~15% for GPT-4 at launch).

View paper / source

Models Tested

78.0

Best Score

72.6

Average Score

0–100

Scale Range

1.3x

Weight

How It Works

Models receive tasks that require multiple steps: searching the web, reading documents, performing calculations, and synthesising information. Tasks are graded on final answer correctness.

Why It Matters

GAIA tests practical AI assistant capability rather than academic knowledge. The large gap between human and AI performance makes it an excellent discriminator for real-world usefulness.

Limitations

Requires tool access (web browsing, code execution) which varies by implementation. Some tasks may be time-sensitive. The gap between human and AI performance is closing rapidly.

Leaderboard — GAIA

#	Model	Provider	Score	Source	Measured
🥇	GPT-5.2	OpenAI	78.0	OpenAI	Dec 2025
🥈	Claude Opus 4.6	Anthropic	75.0	Anthropic	Feb 2026
🥉	o3	OpenAI	72.0	OpenAI	Apr 2025
4	Grok 4	xAI	70.0	xAI	Jul 2025
5	Gemini 2.5 Pro Preview 06-05	Google	68.0	Google	Mar 2025

All Benchmarks