WebArena

agent

WebArena tests autonomous web agents on real tasks in self-hosted realistic environments: e-commerce sites, social media platforms, GitLab instances, and content management systems.

View paper / source

Models Tested

52.0

Best Score

40.4

Average Score

0–100

Scale Range

1.1x

Weight

How It Works

Models must navigate real web interfaces to complete tasks like "Find the cheapest laptop under $500 on the store" or "Create a new repository and add a README". Success is measured by task completion.

Why It Matters

Web interaction is one of the most practical agent capabilities. WebArena tests whether AI can actually use the internet effectively — not just answer questions about it.

Limitations

Self-hosted environments may not perfectly replicate real websites. Tasks are predefined and may not capture the full complexity of web navigation. Setup is resource-intensive.

Leaderboard — WebArena

#	Model	Provider	Score	Source	Measured
🥇	GPT-5.2	OpenAI	52.0	OpenAI	Dec 2025
🥈	Claude Opus 4.6	Anthropic	48.0	Anthropic	Feb 2026
🥉	o3	OpenAI	45.0	OpenAI	Apr 2025
4	Gemini 2.5 Pro Preview 06-05	Google	42.0	Google	Mar 2025
5	Grok 4	xAI	40.0	xAI	Jul 2025
6	Claude Opus 4	Anthropic	38.0	Anthropic	May 2025
7	GPT-4o	OpenAI	30.0	OpenAI	May 2024
8	R1	DeepSeek	28.0	DeepSeek	Jan 2025

All Benchmarks