WildBench Creative
domainCreative subset of WildBench — real user creative writing prompts judged by GPT-4
View paper / source8
Models Tested
88.0
Best Score
81.0
Average Score
0–100
Scale Range
0.8x
Weight
How It Works
Models are evaluated according to the benchmark's standardised protocol.
Why It Matters
This benchmark helps compare AI model capabilities in a standardised way.
Limitations
All benchmarks have limitations and should be considered alongside other evaluations.
Leaderboard — WildBench Creative
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 🥇 | Claude Opus 4.6 | Anthropic | 88.0 | |
| 🥈 | GPT-5.2 | OpenAI | 86.0 | |
| 🥉 | Claude Opus 4 | Anthropic | 84.0 | |
| 4 | Gemini 2.5 Pro Preview 06-05 | 82.0 | | |
| 5 | Claude Sonnet 4 | Anthropic | 82.0 | |
| 6 | Grok 4 | xAI | 80.0 | |
| 7 | GPT-4o | OpenAI | 78.0 | |
| 8 | R1 | DeepSeek | 68.0 | |