What's new

SWE-bench Verified

coding

SWE-bench Verified tests whether AI can resolve real GitHub issues from popular open-source Python repositories. "Verified" means human annotators confirmed each task is solvable and has correct test cases.

View paper / source

24

Models Tested

80.0

Best Score

63.4

Average Score

0–100

Scale Range

1.5x

Weight

How It Works

The model receives a GitHub issue description and the repository state. It must generate a code patch that resolves the issue. Success is measured by whether the generated patch passes the repository's test suite.

Why It Matters

SWE-bench is the gold standard for real-world coding ability. Unlike HumanEval's isolated functions, SWE-bench requires understanding large codebases, diagnosing bugs, and writing patches that work within existing architecture.

Limitations

Only tests Python repositories. Setup and execution is computationally expensive. Some issues require domain-specific knowledge beyond pure coding ability.

Leaderboard — SWE-bench Verified

# Model Provider Score
🥇 GPT-5.2 Pro OpenAI 80.0
🥈 GPT-5.2 OpenAI 78.0
🥉 Claude Opus 4.6 Anthropic 78.0
4 GPT-5 Pro OpenAI 76.5
5 GPT-5 OpenAI 75.0
6 Claude Opus 4.5 Anthropic 75.0
7 o3 Pro OpenAI 73.0
8 Claude Opus 4 Anthropic 72.5
9 Claude Sonnet 4.6 Anthropic 72.0
10 Gemini 3 Pro Google 70.0
11 o3 OpenAI 69.1
12 Claude Sonnet 4.5 Anthropic 68.0
13 Gemini 2.5 Pro Preview 06-05 Google 63.8
14 Claude 3.7 Sonnet Anthropic 62.3
15 R1 0528 DeepSeek 57.6
16 Gemini 3 Flash Preview Google 55.0
17 GPT-4.1 OpenAI 54.6
18 Claude Sonnet 4 Anthropic 53.6
19 Claude Haiku 4.5 Anthropic 52.0
20 DeepSeek V3 0324 DeepSeek 50.0
21 Claude 3.5 Sonnet Anthropic 49.0
22 o1 OpenAI 48.9
23 Mistral Large Mistral 45.0
24 GPT-4.1 Mini OpenAI 42.0
All Benchmarks