MedQA

domain

MedQA contains US Medical Licensing Exam (USMLE)-style questions — 11,450 multiple-choice problems testing medical knowledge and clinical reasoning. It's the gold standard for medical AI evaluation.

5

Models Tested

94.0

Best Score

90.4

Average Score

0–100

Scale Range

0.8x

Weight

How It Works

Models answer USMLE-style multiple-choice questions covering basic science, clinical medicine, and patient management. Questions require integrating knowledge across anatomy, pharmacology, pathology, and clinical practice.

Why It Matters

Healthcare is one of the highest-stakes applications for AI. MedQA provides a rigorous, standardised way to evaluate whether AI models have the medical knowledge needed to assist healthcare professionals.

Limitations

US-centric medical curriculum. Multiple-choice format doesn't capture real clinical reasoning complexity. High scores don't mean the model is safe for clinical use — it only tests knowledge, not judgement.

Leaderboard — MedQA

# Model Provider Score
🥇 GPT-5.2 OpenAI 94.0
🥈 o3 OpenAI 92.0
🥉 Gemini 2.5 Pro Preview 06-05 Google 91.0
4 Claude Opus 4 Anthropic 89.0
5 GPT-4o OpenAI 86.1
All Benchmarks