HumanEval
codingHumanEval measures code generation ability by asking models to complete Python functions given a docstring description. It consists of 164 hand-crafted programming problems.
View paper / source15
Models Tested
97.0
Best Score
91.7
Average Score
0–100
Scale Range
1x
Weight
How It Works
The model receives a function signature and docstring, then must generate the function body. Each solution is tested against a suite of unit tests. The primary metric is pass@1 — the percentage of problems solved correctly on the first attempt.
Why It Matters
Code generation is one of the most practical and measurable AI capabilities. HumanEval provides a standardised way to compare models on programming tasks that range from simple string manipulation to algorithmic problem-solving.
Limitations
Only tests Python. Problems are relatively simple compared to real software engineering. Models may have memorised solutions from training data. Does not test debugging, code review, or working with existing codebases.
Leaderboard — HumanEval
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | o3 | OpenAI | 97.0 | |
| 🥈 | o4 Mini | OpenAI | 96.0 | |
| 🥉 | Claude Opus 4 | Anthropic | 95.0 | |
| 4 | Grok 3 Beta | xAI | 93.8 | |
| 5 | GPT-4.1 | OpenAI | 93.4 | |
| 6 | Gemini 2.5 Pro Preview 06-05 | 93.2 | | |
| 7 | Claude Sonnet 4 | Anthropic | 93.0 | |
| 8 | R1 | DeepSeek | 92.5 | |
| 9 | DeepSeek V3 0324 | DeepSeek | 91.0 | |
| 10 | GPT-4o (2024-05-13) | OpenAI | 90.2 | |
| 11 | DeepSeek V3 | DeepSeek | 89.5 | |
| 12 | Gemini 2.5 Flash | 88.5 | | |
| 13 | QwQ 32B | Alibaba | 88.0 | |
| 14 | Llama 4 Maverick | Meta | 87.5 | |
| 15 | Qwen2.5 72B Instruct | Alibaba | 86.6 | |