Aider Polyglot

coding

Aider Polyglot evaluates coding ability across 225 Exercism exercises in 6 languages: C++, Go, Java, JavaScript, Python, and Rust. Models get two attempts per problem with test error feedback.

View paper / source

7

Models Tested

82.0

Best Score

75.4

Average Score

0–100

Scale Range

1.1x

Weight

How It Works

Models solve programming exercises and run them against test suites. If the first attempt fails, models receive the error output and can try again. The benchmark uniquely tracks both accuracy and cost per task.

Why It Matters

Real software engineering requires proficiency across multiple languages, not just Python. Aider Polyglot tests breadth of coding ability and the practical skill of debugging from test failures.

Limitations

Exercism problems are relatively contained — they don't test working with large codebases. Only 6 languages are covered. Two-attempt format may not reflect real-world usage patterns.

Leaderboard — Aider Polyglot

# Model Provider Score
🥇 Claude Opus 4.6 Anthropic 82.0
🥈 GPT-5.2 OpenAI 80.0
🥉 Claude Sonnet 4.6 Anthropic 79.0
4 o3 OpenAI 76.0
5 Qwen2.5 Coder 32B Instruct Alibaba 73.7
6 Gemini 2.5 Pro Preview 06-05 Google 72.0
7 R1 DeepSeek 65.0
All Benchmarks