MedQA
domainMedQA contains US Medical Licensing Exam (USMLE)-style questions — 11,450 multiple-choice problems testing medical knowledge and clinical reasoning. It's the gold standard for medical AI evaluation.
5
Models Tested
94.0
Best Score
90.4
Average Score
0–100
Scale Range
0.8x
Weight
How It Works
Models answer USMLE-style multiple-choice questions covering basic science, clinical medicine, and patient management. Questions require integrating knowledge across anatomy, pharmacology, pathology, and clinical practice.
Why It Matters
Healthcare is one of the highest-stakes applications for AI. MedQA provides a rigorous, standardised way to evaluate whether AI models have the medical knowledge needed to assist healthcare professionals.
Limitations
US-centric medical curriculum. Multiple-choice format doesn't capture real clinical reasoning complexity. High scores don't mean the model is safe for clinical use — it only tests knowledge, not judgement.
Leaderboard — MedQA
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | GPT-5.2 | OpenAI | 94.0 | |
| 🥈 | o3 | OpenAI | 92.0 | |
| 🥉 | Gemini 2.5 Pro Preview 06-05 | 91.0 | | |
| 4 | Claude Opus 4 | Anthropic | 89.0 | |
| 5 | GPT-4o | OpenAI | 86.1 | |