SWE-bench Verified

coding

SWE-bench Verified tests whether AI can resolve real GitHub issues from popular open-source Python repositories. "Verified" means human annotators confirmed each task is solvable and has correct test cases.

View paper / source

10

Models Tested

78.0

Best Score

69.0

Average Score

0–100

Scale Range

1.5x

Weight

How It Works

The model receives a GitHub issue description and the repository state. It must generate a code patch that resolves the issue. Success is measured by whether the generated patch passes the repository's test suite.

Why It Matters

SWE-bench is the gold standard for real-world coding ability. Unlike HumanEval's isolated functions, SWE-bench requires understanding large codebases, diagnosing bugs, and writing patches that work within existing architecture.

Limitations

Only tests Python repositories. Setup and execution is computationally expensive. Some issues require domain-specific knowledge beyond pure coding ability.

Leaderboard — SWE-bench Verified

# Model Provider Score
🥇 GPT-5.2 OpenAI 78.0
🥈 Claude Opus 4.6 Anthropic 78.0
🥉 GPT-5 OpenAI 75.0
4 o3 Pro OpenAI 73.0
5 Claude Opus 4 Anthropic 72.5
6 Claude Sonnet 4.6 Anthropic 72.0
7 o3 OpenAI 69.1
8 Gemini 2.5 Pro Preview 06-05 Google 63.8
9 GPT-4.1 OpenAI 54.6
10 Claude Sonnet 4 Anthropic 53.6
All Benchmarks