MedQA

domain

MedQA contains US Medical Licensing Exam (USMLE)-style questions — 11,450 multiple-choice problems testing medical knowledge and clinical reasoning. It's the gold standard for medical AI evaluation.

Models Tested

94.0

Best Score

90.4

Average Score

0–100

Scale Range

0.8x

Weight

How It Works

Models answer USMLE-style multiple-choice questions covering basic science, clinical medicine, and patient management. Questions require integrating knowledge across anatomy, pharmacology, pathology, and clinical practice.

Why It Matters

Healthcare is one of the highest-stakes applications for AI. MedQA provides a rigorous, standardised way to evaluate whether AI models have the medical knowledge needed to assist healthcare professionals.

Limitations

US-centric medical curriculum. Multiple-choice format doesn't capture real clinical reasoning complexity. High scores don't mean the model is safe for clinical use — it only tests knowledge, not judgement.

Leaderboard — MedQA

#	Model	Provider	Score	Source	Measured
🥇	GPT-5.2	OpenAI	94.0	OpenAI	Dec 2025
🥈	o3	OpenAI	92.0	OpenAI	Apr 2025
🥉	Gemini 2.5 Pro Preview 06-05	Google	91.0	Google	Mar 2025
4	Claude Opus 4	Anthropic	89.0	Anthropic	May 2025
5	GPT-4o	OpenAI	86.1	OpenAI	May 2024

All Benchmarks