SWE-bench Verified
codingSWE-bench Verified tests whether AI can resolve real GitHub issues from popular open-source Python repositories. "Verified" means human annotators confirmed each task is solvable and has correct test cases.
View paper / source24
Models Tested
80.0
Best Score
63.4
Average Score
0–100
Scale Range
1.5x
Weight
How It Works
The model receives a GitHub issue description and the repository state. It must generate a code patch that resolves the issue. Success is measured by whether the generated patch passes the repository's test suite.
Why It Matters
SWE-bench is the gold standard for real-world coding ability. Unlike HumanEval's isolated functions, SWE-bench requires understanding large codebases, diagnosing bugs, and writing patches that work within existing architecture.
Limitations
Only tests Python repositories. Setup and execution is computationally expensive. Some issues require domain-specific knowledge beyond pure coding ability.
Leaderboard — SWE-bench Verified
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | GPT-5.2 Pro | OpenAI | 80.0 | |
| 🥈 | GPT-5.2 | OpenAI | 78.0 | |
| 🥉 | Claude Opus 4.6 | Anthropic | 78.0 | |
| 4 | GPT-5 Pro | OpenAI | 76.5 | |
| 5 | GPT-5 | OpenAI | 75.0 | |
| 6 | Claude Opus 4.5 | Anthropic | 75.0 | |
| 7 | o3 Pro | OpenAI | 73.0 | |
| 8 | Claude Opus 4 | Anthropic | 72.5 | |
| 9 | Claude Sonnet 4.6 | Anthropic | 72.0 | |
| 10 | Gemini 3 Pro | 70.0 | | |
| 11 | o3 | OpenAI | 69.1 | |
| 12 | Claude Sonnet 4.5 | Anthropic | 68.0 | |
| 13 | Gemini 2.5 Pro Preview 06-05 | 63.8 | | |
| 14 | Claude 3.7 Sonnet | Anthropic | 62.3 | |
| 15 | R1 0528 | DeepSeek | 57.6 | |
| 16 | Gemini 3 Flash Preview | 55.0 | | |
| 17 | GPT-4.1 | OpenAI | 54.6 | |
| 18 | Claude Sonnet 4 | Anthropic | 53.6 | |
| 19 | Claude Haiku 4.5 | Anthropic | 52.0 | |
| 20 | DeepSeek V3 0324 | DeepSeek | 50.0 | |
| 21 | Claude 3.5 Sonnet | Anthropic | 49.0 | |
| 22 | o1 | OpenAI | 48.9 | |
| 23 | Mistral Large | Mistral | 45.0 | |
| 24 | GPT-4.1 Mini | OpenAI | 42.0 | |