nowJobs market snapshot refreshed nowRecomputed benchmark-weighted quality scores nowUpdated speed measurements nowSynced Chatbot Arena benchmark track nowValidated official pricing snapshots nowPulled latest OpenRouter price index 25 MayOpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform 25 MayPublished the 2026-05-25 daily digest 25 MayWorkbench Launches Open Source BullMQ Dashboard For Node Backends 24 MaySpecBench Tests Reward Hacking In Long Horizon Coding Agents nowJobs market snapshot refreshed nowRecomputed benchmark-weighted quality scores nowUpdated speed measurements nowSynced Chatbot Arena benchmark track nowValidated official pricing snapshots nowPulled latest OpenRouter price index 25 MayOpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform 25 MayPublished the 2026-05-25 daily digest 25 MayWorkbench Launches Open Source BullMQ Dashboard For Node Backends 24 MaySpecBench Tests Reward Hacking In Long Horizon Coding Agents

HumanEval

coding

HumanEval measures code generation ability by asking models to complete Python functions given a docstring description. It consists of 164 hand-crafted programming problems.

View paper / source

Models Tested

97.0

Best Score

90.0

Average Score

0–100

Scale Range

Weight

How It Works

The model receives a function signature and docstring, then must generate the function body. Each solution is tested against a suite of unit tests. The primary metric is pass@1 — the percentage of problems solved correctly on the first attempt.

Why It Matters

Code generation is one of the most practical and measurable AI capabilities. HumanEval provides a standardised way to compare models on programming tasks that range from simple string manipulation to algorithmic problem-solving.

Limitations

Only tests Python. Problems are relatively simple compared to real software engineering. Models may have memorised solutions from training data. Does not test debugging, code review, or working with existing codebases.

Leaderboard — HumanEval

#	Model	Provider	Score	Source	Measured
🥇	o3	OpenAI	97.0	OpenAI	Apr 2025
🥈	o4 Mini	OpenAI	96.0	OpenAI	Apr 2025
🥉	Claude Opus 4.5	Anthropic	95.5	Anthropic	Nov 2025
4	Claude Opus 4	Anthropic	95.0	Anthropic	May 2025
5	o3 Mini	OpenAI	94.5	OpenAI	Jan 2025
6	o1	OpenAI	94.0	OpenAI	Dec 2024
7	R1 0528	DeepSeek	94.0	DeepSeek	May 2025
8	Grok 3	xAI	93.8	xAI	Jun 2025
9	Claude 3.5 Sonnet	Anthropic	93.7	Anthropic	Oct 2024
10	GPT-4.1	OpenAI	93.4	OpenAI	Apr 2025
11	Gemini 2.5 Pro Preview 06-05	Google	93.2	Google	Mar 2025
12	Claude Sonnet 4	Anthropic	93.0	Anthropic	May 2025
13	Qwen3 Max	Alibaba	93.0	Alibaba	Sept 2025
14	Qwen2.5 Coder 32B Instruct	Alibaba	92.7	Alibaba	Nov 2024
15	R1	DeepSeek	92.5	DeepSeek	Jan 2025
16	o1-mini	OpenAI	92.4	OpenAI	Sept 2024
17	DeepSeek V3 0324	DeepSeek	91.0	DeepSeek	Sept 2025
18	GPT-4.5	OpenAI	91.0	OpenAI	Feb 2025
19	GPT-4.1 Mini	OpenAI	90.5	OpenAI	Apr 2025
20	GPT-4o (2024-05-13)	OpenAI	90.2	OpenAI	May 2024
21	Claude Haiku 4.5	Anthropic	90.0	Anthropic	Oct 2025
22	Grok 4 Fast	xAI	90.0	xAI	Sept 2025
23	DeepSeek V3	DeepSeek	89.5	DeepSeek	Dec 2024
24	Mistral Large	Mistral	89.5	Mistral	Jun 2025
25	Gemini 3 Flash Preview	Google	89.0	Google	Nov 2025
26	Llama 3.1 405B	Meta	89.0	Meta	Jul 2024
27	Phi-4 Reasoning	Microsoft	89.0	Microsoft	May 2025
28	Gemini 2.5 Flash	Google	88.5	Google	May 2025
29	Llama 3.3 70B Instruct	Meta	88.4	Meta	Dec 2024
30	Claude 3.5 Haiku	Anthropic	88.1	Anthropic	Oct 2024
31	QwQ 32B	Alibaba	88.0	Alibaba	Mar 2025
32	Grok 3 Mini	xAI	88.0	xAI	Jun 2025
33	Llama 4 Maverick	Meta	87.5	Meta	Apr 2025
34	GPT-4o-mini (2024-07-18)	OpenAI	87.2	OpenAI	Jul 2024
35	Qwen2.5 72B Instruct	Alibaba	86.6	Alibaba	Sept 2024
36	Llama 4 Scout	Meta	85.0	Meta	Apr 2025
37	Command A	Cohere	85.0	Cohere	Mar 2025
38	Claude 3 Opus	Anthropic	84.9	Anthropic	Mar 2024
39	Gemini 1.5 Pro	Google	84.1	Google	Feb 2024
40	Phi 4	Microsoft	82.6	Microsoft	Dec 2024
41	Gemini 2.5 Flash Lite	Google	82.0	Google	May 2025
42	Mistral Small 3.1 24B	Mistral	80.0	Mistral	Mar 2025

All Benchmarks