Bench-Maxing and Why You Should Test AI Models on Your Actual Work
Labs are increasingly accused of optimising models to score well on formal benchmarks rather than to be genuinely better. The fix is unglamorous: a small set of real tasks, run manually, tracked over time.
The Bench-Maxing Problem
Every frontier model launch now comes with a benchmark table. MMLU: 91.2%. GPQA Diamond: 75.4%. SWE-bench: 62%. The numbers go up every few months. But something is getting harder to explain: for a lot of practitioners, the experience of using these models on actual work is not improving at the same pace.
The term "bench-maxing" has entered the AI discourse to describe the practice of training or fine-tuning models specifically to perform well on the benchmarks used to evaluate them, rather than improving general capability. It is the AI equivalent of teaching to the test. A model can be specifically optimised for MMLU question formats, or trained on data that overlaps with HumanEval problems, and post impressive numbers without becoming more useful to the person asking it to debug a script or summarise a contract.
This is not always deliberate bad faith. But the incentive structure is clear: high benchmark scores generate press coverage and user sign-ups. The labs know which benchmarks matter to which audiences. And evaluation contamination — where benchmark questions end up in training data — is notoriously difficult to detect or prevent.
The contamination problem
When a benchmark is published, its questions become part of the public internet. Future models trained on web data will inevitably see them. Whether labs explicitly include benchmark datasets in training is a separate (and disputed) question — but even passive contamination is enough to inflate scores over time, making it impossible to compare a 2026 model's MMLU score against a 2024 model's on equal terms.
Why Community Signal Is More Honest
The most useful signal for which model is actually better at a given task often comes from practitioners in domain-specific communities — not from benchmark tables. On subreddits like r/dataengineering, r/excel, r/financialmodelling, or r/legaladvice, people share what actually worked and what failed — in the context of their real work, with real data, and with an outcome that mattered to them.
This signal has its own problems: it is anecdotal, it skews toward vocal users, and it reflects the demographics of those communities. But it has one property that formal benchmarks cannot replicate: the tasks were not designed to be evaluated. Nobody optimised for what r/dataengineering users would post about.
The same applies to Chatbot Arena — the LMSYS crowdsourced Elo leaderboard where real users vote on real conversations. It is imperfect and can be gamed through coordination, but the sheer volume (millions of votes) and the fact that users are bringing their own tasks makes it much harder to bench-max against. It is the closest thing to a representative real-world signal that currently exists at scale.
The Case for a Personal Benchmark Kit
If no public benchmark reliably measures performance on your specific work, the logical response is to build a small one yourself. Not a scientific study — a personal regression test suite. The goal is not to produce a number you can cite; it is to answer the question: "Did this new model do better or worse on the tasks I actually care about?"
This is more tractable than it sounds. You only need around 15–25 tasks. The critical rule is that prompts must be fixed — never change a prompt once it is written, because then you lose historical comparability. A spreadsheet is sufficient infrastructure.
Task design: the actual hard part
Writing good benchmark tasks is harder than it sounds. Two rules apply:
- Easy tasks are useless. If every model scores full marks, you learn nothing. Tasks need to be difficult enough that at least one model fails part of the rubric.
- Use your real failure cases. The best tasks come from moments when a model gave you a bad answer at work. Document those. They are a natural test battery because you already know what a good answer looks like.
Organise tasks into categories that reflect your actual usage. A plausible structure:
| Category | Tasks | Scoring approach |
|---|---|---|
| Code / formula generation | 5–8 | Gold standard — pre-write the correct answer, score correct / partial / wrong |
| Debugging / error diagnosis | 3–5 | Gold standard — did it identify the root cause? Did it fix it correctly? |
| Analysis and interpretation | 3–5 | Rubric scoring — pre-define 3–4 criteria before running any model |
| Your known edge cases | 3–5 | Either approach — these are the tasks where models typically trip up for you |
Three scoring approaches
1. Gold standard answers — for tasks with a definite correct answer (code that runs, a formula that produces the right result), write the answer before you test any model. Score is: correct / partially correct / wrong. Fast and unambiguous.
2. Rubric scoring — for tasks that require judgement (analysis quality, explanation clarity), pre-define 3–4 criteria before running any model. Writing rubrics before testing prevents unconscious bias toward whatever the first model said. Example:
Task: "Explain why this Power Query step is slow and propose a fix"
[ ] Correctly identifies the bottleneck (1 pt)
[ ] Proposes a valid optimisation (1 pt)
[ ] Explains the trade-off or limitation of the fix (1 pt)
[ ] Answer is understandable to a non-expert (1 pt)
Max: 4 pts
3. LLM-as-judge — run Model A on a task, then paste the output into Model B with the rubric and ask it to grade strictly. Use the model you are not testing as the judge. Claude grades GPT-4o outputs; GPT-4o grades Claude outputs. This reduces personal bias and scales effort. It is imperfect but significantly more consistent than relying on your own intuition, especially across a dozen tasks.
The tracking sheet
A spreadsheet with fixed columns per model version, run each time something significant is released:
| Task ID | Category | Max | GPT-4o Jan | Claude Jan | GPT-4o Feb | Claude Feb |
|---|---|---|---|---|---|---|
| F-01 | Formula | 4 | 3 | 4 | 3 | 4 |
| D-01 | Debug | 3 | 2 | 3 | 3 | 3 |
| Total | 31/50 | 38/50 | 35/50 | 41/50 | ||
The absolute numbers are not scientifically rigorous. What matters is the delta over time — regression is immediately visible.
A Worked Example: Power Query, M Code, and VBA
To make this concrete: one domain where the gap between benchmark scores and real-world usefulness is particularly apparent is spreadsheet automation — specifically Power Query, M code, and VBA. These are narrow, practical skills used daily in financial analysis, operations, and data processing work. No published benchmark specifically tests them.
A small personal test suite for this domain might look like:
- M code generation (5 tasks) — "Write a Power Query step that unpivots these columns and handles nulls", "Merge these two queries on a fuzzy match". Gold standard scoring: does the code run? Does it produce the right output?
- VBA debugging (3 tasks) — paste a broken macro, ask for the fix. Gold standard: does the corrected code execute without error?
- Explanation quality (3 tasks) — "Explain what this M expression does and where it might fail". Rubric: correctness, completeness, clarity to a non-developer.
- Known edge cases (3 tasks) — the specific scenarios that previously caught a model out. These are your most valuable tasks.
This is a narrow, small, very personal test. It will not tell you which model is best for legal drafting or medical summarisation. It will tell you, reliably, which model is currently best for this work. That is exactly the information a benchmark table cannot give you.
Community Signal as a Complement
Personal benchmarking captures your own experience. For a broader view of how models perform in a given domain, practitioner communities provide real signal that formal benchmarks cannot. People in domain-specific subreddits and forums share their experiences with AI tools in the context of real problems — not toy examples designed for evaluation.
The limitation is that this signal is qualitative and anecdotal. The value is that it is almost impossible to game. No lab has yet optimised a model specifically to score well in the spontaneous opinions of r/dataengineering members.
Together, personal testing and community observation are probably more predictive of real-world usefulness than the benchmark table in any model's press release.
Our position on benchmarks
We track formal benchmarks on this site because they are the best standardised data available and useful for directional comparisons — especially across model families and capability categories. But we weight community consensus (Chatbot Arena Elo, practitioner feedback) heavily in our quality scores precisely because it is harder to game. Where we know a benchmark has saturation or contamination problems, we flag it. Our methodology is documented in full on the Methodology page.