WildBench Creative

domain

Creative subset of WildBench — real user creative writing prompts judged by GPT-4

View paper / source

8

Models Tested

88.0

Best Score

81.0

Average Score

0–100

Scale Range

0.8x

Weight

How It Works

Models are evaluated according to the benchmark's standardised protocol.

Why It Matters

This benchmark helps compare AI model capabilities in a standardised way.

Limitations

All benchmarks have limitations and should be considered alongside other evaluations.

Leaderboard — WildBench Creative

# Model Provider Score
🥇 Claude Opus 4.6 Anthropic 88.0
🥈 GPT-5.2 OpenAI 86.0
🥉 Claude Opus 4 Anthropic 84.0
4 Gemini 2.5 Pro Preview 06-05 Google 82.0
5 Claude Sonnet 4 Anthropic 82.0
6 Grok 4 xAI 80.0
7 GPT-4o OpenAI 78.0
8 R1 DeepSeek 68.0
All Benchmarks