HumanEval

coding

HumanEval measures code generation ability by asking models to complete Python functions given a docstring description. It consists of 164 hand-crafted programming problems.

View paper / source

15

Models Tested

97.0

Best Score

91.7

Average Score

0–100

Scale Range

1x

Weight

How It Works

The model receives a function signature and docstring, then must generate the function body. Each solution is tested against a suite of unit tests. The primary metric is pass@1 — the percentage of problems solved correctly on the first attempt.

Why It Matters

Code generation is one of the most practical and measurable AI capabilities. HumanEval provides a standardised way to compare models on programming tasks that range from simple string manipulation to algorithmic problem-solving.

Limitations

Only tests Python. Problems are relatively simple compared to real software engineering. Models may have memorised solutions from training data. Does not test debugging, code review, or working with existing codebases.

Leaderboard — HumanEval

# Model Provider Score
🥇 o3 OpenAI 97.0
🥈 o4 Mini OpenAI 96.0
🥉 Claude Opus 4 Anthropic 95.0
4 Grok 3 Beta xAI 93.8
5 GPT-4.1 OpenAI 93.4
6 Gemini 2.5 Pro Preview 06-05 Google 93.2
7 Claude Sonnet 4 Anthropic 93.0
8 R1 DeepSeek 92.5
9 DeepSeek V3 0324 DeepSeek 91.0
10 GPT-4o (2024-05-13) OpenAI 90.2
11 DeepSeek V3 DeepSeek 89.5
12 Gemini 2.5 Flash Google 88.5
13 QwQ 32B Alibaba 88.0
14 Llama 4 Maverick Meta 87.5
15 Qwen2.5 72B Instruct Alibaba 86.6
All Benchmarks