WebArena

agent

WebArena tests autonomous web agents on real tasks in self-hosted realistic environments: e-commerce sites, social media platforms, GitLab instances, and content management systems.

View paper / source

8

Models Tested

52.0

Best Score

40.4

Average Score

0–100

Scale Range

1.1x

Weight

How It Works

Models must navigate real web interfaces to complete tasks like "Find the cheapest laptop under $500 on the store" or "Create a new repository and add a README". Success is measured by task completion.

Why It Matters

Web interaction is one of the most practical agent capabilities. WebArena tests whether AI can actually use the internet effectively — not just answer questions about it.

Limitations

Self-hosted environments may not perfectly replicate real websites. Tasks are predefined and may not capture the full complexity of web navigation. Setup is resource-intensive.

Leaderboard — WebArena

# Model Provider Score
🥇 GPT-5.2 OpenAI 52.0
🥈 Claude Opus 4.6 Anthropic 48.0
🥉 o3 OpenAI 45.0
4 Gemini 2.5 Pro Preview 06-05 Google 42.0
5 Grok 4 xAI 40.0
6 Claude Opus 4 Anthropic 38.0
7 GPT-4o OpenAI 30.0
8 R1 DeepSeek 28.0
All Benchmarks