SWE-bench Verified

coding

SWE-bench Verified tests whether AI can resolve real GitHub issues from popular open-source Python repositories. "Verified" means human annotators confirmed each task is solvable and has correct test cases.

View paper / source

Models Tested

78.0

Best Score

69.0

Average Score

0–100

Scale Range

1.5x

Weight

How It Works

The model receives a GitHub issue description and the repository state. It must generate a code patch that resolves the issue. Success is measured by whether the generated patch passes the repository's test suite.

Why It Matters

SWE-bench is the gold standard for real-world coding ability. Unlike HumanEval's isolated functions, SWE-bench requires understanding large codebases, diagnosing bugs, and writing patches that work within existing architecture.

Limitations

Only tests Python repositories. Setup and execution is computationally expensive. Some issues require domain-specific knowledge beyond pure coding ability.

Leaderboard — SWE-bench Verified

#	Model	Provider	Score	Source	Measured
🥇	GPT-5.2	OpenAI	78.0	OpenAI	Dec 2025
🥈	Claude Opus 4.6	Anthropic	78.0	Anthropic	Feb 2026
🥉	GPT-5	OpenAI	75.0	OpenAI	Aug 2025
4	o3 Pro	OpenAI	73.0	OpenAI	Jun 2025
5	Claude Opus 4	Anthropic	72.5	Anthropic	May 2025
6	Claude Sonnet 4.6	Anthropic	72.0	Anthropic	Feb 2026
7	o3	OpenAI	69.1	OpenAI	Apr 2025
8	Gemini 2.5 Pro Preview 06-05	Google	63.8	Google	Jun 2025
9	GPT-4.1	OpenAI	54.6	OpenAI	Apr 2025
10	Claude Sonnet 4	Anthropic	53.6	Anthropic	May 2025

All Benchmarks