Aider Polyglot
codingAider Polyglot evaluates coding ability across 225 Exercism exercises in 6 languages: C++, Go, Java, JavaScript, Python, and Rust. Models get two attempts per problem with test error feedback.
View paper / source7
Models Tested
82.0
Best Score
75.4
Average Score
0–100
Scale Range
1.1x
Weight
How It Works
Models solve programming exercises and run them against test suites. If the first attempt fails, models receive the error output and can try again. The benchmark uniquely tracks both accuracy and cost per task.
Why It Matters
Real software engineering requires proficiency across multiple languages, not just Python. Aider Polyglot tests breadth of coding ability and the practical skill of debugging from test failures.
Limitations
Exercism problems are relatively contained — they don't test working with large codebases. Only 6 languages are covered. Two-attempt format may not reflect real-world usage patterns.
Leaderboard — Aider Polyglot
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | Claude Opus 4.6 | Anthropic | 82.0 | |
| 🥈 | GPT-5.2 | OpenAI | 80.0 | |
| 🥉 | Claude Sonnet 4.6 | Anthropic | 79.0 | |
| 4 | o3 | OpenAI | 76.0 | |
| 5 | Qwen2.5 Coder 32B Instruct | Alibaba | 73.7 | |
| 6 | Gemini 2.5 Pro Preview 06-05 | 72.0 | | |
| 7 | R1 | DeepSeek | 65.0 | |