SWE-bench Verified
codingSWE-bench Verified tests whether AI can resolve real GitHub issues from popular open-source Python repositories. "Verified" means human annotators confirmed each task is solvable and has correct test cases.
View paper / source10
Models Tested
78.0
Best Score
69.0
Average Score
0–100
Scale Range
1.5x
Weight
How It Works
The model receives a GitHub issue description and the repository state. It must generate a code patch that resolves the issue. Success is measured by whether the generated patch passes the repository's test suite.
Why It Matters
SWE-bench is the gold standard for real-world coding ability. Unlike HumanEval's isolated functions, SWE-bench requires understanding large codebases, diagnosing bugs, and writing patches that work within existing architecture.
Limitations
Only tests Python repositories. Setup and execution is computationally expensive. Some issues require domain-specific knowledge beyond pure coding ability.
Leaderboard — SWE-bench Verified
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | GPT-5.2 | OpenAI | 78.0 | |
| 🥈 | Claude Opus 4.6 | Anthropic | 78.0 | |
| 🥉 | GPT-5 | OpenAI | 75.0 | |
| 4 | o3 Pro | OpenAI | 73.0 | |
| 5 | Claude Opus 4 | Anthropic | 72.5 | |
| 6 | Claude Sonnet 4.6 | Anthropic | 72.0 | |
| 7 | o3 | OpenAI | 69.1 | |
| 8 | Gemini 2.5 Pro Preview 06-05 | 63.8 | | |
| 9 | GPT-4.1 | OpenAI | 54.6 | |
| 10 | Claude Sonnet 4 | Anthropic | 53.6 | |