What's new

Multimodal AI Models

33 LLMs that can process more than just text — vision, audio, video, and more. Ranked by capability breadth and quality.

0

Vision (Image Input)

5

Audio Input

0

Video Input

21

Text Only

Multimodal Models

# Model Modalities Quality
1 Gemini 2.5 Pro Google
Text vision Audio
83.0
2 Gemini 2.0 Flash Google
Text vision Audio
81.0
3 Gemini 2.5 Flash Google
Text vision Audio
78.0
4 Gemini 2.0 Flash Lite Google
Text vision Audio
75.0
5 Gemini 2.5 Flash Lite Google
Text vision Audio
74.0
6 GPT-5.2 Pro OpenAI
Text vision
93.0
7 GPT-5 Pro OpenAI
Text vision
90.0
8 GPT-5.2 OpenAI
Text vision
90.0
9 O4 Mini OpenAI
Text vision
90.0
10 Claude Opus 4.6 Anthropic
Text vision
89.0
11 O3 OpenAI
Text vision
88.0
12 O3 Pro OpenAI
Text vision
88.0
13 GPT-5 OpenAI
Text vision
87.0
14 Claude Opus 4.5 Anthropic
Text vision
86.0
15 Claude Sonnet 4.6 Anthropic
Text vision
86.0
16 Claude Opus 4 Anthropic
Text vision
84.0
17 o1 OpenAI
Text vision
84.0
18 Claude Sonnet 4 Anthropic
Text vision
79.0
19 Claude Sonnet 4.5 Anthropic
Text vision
79.0
20 Llama 4 Scout OSS Meta
Text vision
79.0
21 GPT-5 Nano OpenAI
Text vision
78.0
22 Nova Pro 1.0 Amazon
Text vision
78.0
23 GPT-4.1 OpenAI
Text vision
77.0
24 Claude 3.5 Haiku Anthropic
Text vision
76.0
25 Claude Haiku 4.5 Anthropic
Text vision
76.0
26 Llama 4 Maverick OSS Meta
Text vision
76.0
27 GPT-4.1 Mini OpenAI
Text vision
75.0
28 GPT-4.1 Nano OpenAI
Text vision
75.0
29 GPT-4o (2024-05-13) OpenAI
Text vision
75.0
30 GPT-4o-mini OpenAI
Text vision
74.0
31 Sonar Perplexity
Text vision
74.0
32 Mistral Small 3.1 24B OSS Mistral
Text vision
72.0
33 Nova Lite 1.0 Amazon
Text vision
72.0

Text-Only Models

These models process text input only. Ranked by quality score.

# Model Quality
1 Qwen3 235B A22B OSS Alibaba 87.0
2 Qwen3 Max OSS Alibaba 85.0
3 DeepSeek R1 OSS DeepSeek 85.0
4 o3 Mini OpenAI 84.0
5 R1 0528 OSS DeepSeek 83.0
6 Qwen2.5 Coder 32B Instruct OSS Alibaba 82.0
7 Command A OSS Cohere 80.0
8 Command R+ (08-2024) OSS Cohere 79.0
9 DeepSeek V3.2 OSS DeepSeek 77.0
10 Llama 3.1 70B Instruct OSS Meta 77.0
11 DeepSeek V3 OSS DeepSeek 76.0
12 Phi 4 OSS Microsoft 74.0
13 Reka Flash 3 Reka 74.0
14 Command R (08-2024) OSS Cohere 73.0
15 Mistral Large OSS Mistral 73.0
16 Mistral Nemo OSS Mistral 72.0
17 Llama 3.3 70B Instruct OSS Meta 71.0
18 Qwen2.5 72B Instruct OSS Alibaba 71.0
19 Llama 3.1 8B Instruct OSS Meta 68.0
20 Nova Micro 1.0 Amazon 68.0
21 Command R7B (12-2024) OSS Cohere 65.0

What are multimodal AI models?

Multimodal models can process multiple types of input — not just text, but also images, audio, and video. This enables use cases like:

  • Vision: Analyse images, read charts, describe photos, OCR documents
  • Audio: Transcribe speech, understand tone, process music
  • Video: Summarise videos, extract key frames, answer questions about video content