WebArena
agentWebArena tests autonomous web agents on real tasks in self-hosted realistic environments: e-commerce sites, social media platforms, GitLab instances, and content management systems.
View paper / source8
Models Tested
52.0
Best Score
40.4
Average Score
0–100
Scale Range
1.1x
Weight
How It Works
Models must navigate real web interfaces to complete tasks like "Find the cheapest laptop under $500 on the store" or "Create a new repository and add a README". Success is measured by task completion.
Why It Matters
Web interaction is one of the most practical agent capabilities. WebArena tests whether AI can actually use the internet effectively — not just answer questions about it.
Limitations
Self-hosted environments may not perfectly replicate real websites. Tasks are predefined and may not capture the full complexity of web navigation. Setup is resource-intensive.