ARC Challenge

reasoning

ARC (AI2 Reasoning Challenge) tests grade-school level science reasoning. The "Challenge" set contains questions that are difficult for retrieval-based and word co-occurrence methods.

View paper / source

0

Models Tested

0.0

Average Score

0–100

Scale Range

0.6x

Weight

How It Works

Multiple-choice science questions from 3rd to 9th grade standardised tests. The Challenge set specifically includes questions that simple statistical methods and retrieval systems get wrong.

Why It Matters

ARC tests fundamental scientific reasoning ability — the kind of common-sense understanding that humans develop early. It helps identify whether models can reason about cause and effect in the physical world.

Limitations

Most modern LLMs now score very highly (>95%), making it less useful for differentiating frontier models. Questions are US-centric.

Leaderboard — ARC Challenge

No model scores recorded yet for this benchmark.
All Benchmarks