Multimodal AI Models

26 LLMs that can process more than just text — vision, audio, video, and more. Ranked by capability breadth and quality.

0

Vision (Image Input)

5

Audio Input

0

Video Input

16

Text Only

Multimodal Models

# Model Modalities Quality
1 Gemini 2.5 Pro Google
Text vision Audio
83.0
2 Gemini 2.0 Flash Google
Text vision Audio
81.0
3 Gemini 2.5 Flash Google
Text vision Audio
78.0
4 Gemini 2.5 Flash Lite Google
Text vision Audio
78.0
5 Gemini 2.0 Flash Lite Google
Text vision Audio
76.0
6 GPT-5.2 OpenAI
Text vision
90.0
7 O4 Mini OpenAI
Text vision
90.0
8 Claude Opus 4.6 Anthropic
Text vision
89.0
9 Grok 4 xAI
Text vision
88.0
10 O3 OpenAI
Text vision
88.0
11 O3 Pro OpenAI
Text vision
88.0
12 GPT-5 OpenAI
Text vision
87.0
13 Claude Sonnet 4.6 Anthropic
Text vision
86.0
14 Claude Opus 4 Anthropic
Text vision
84.0
15 Claude 3.5 Haiku Anthropic
Text vision
82.0
16 GPT-4o-mini OpenAI
Text vision
80.0
17 Claude Sonnet 4 Anthropic
Text vision
79.0
18 GPT-5 Nano OpenAI
Text vision
78.0
19 Nova Pro 1.0 Amazon
Text vision
78.0
20 GPT-4.1 OpenAI
Text vision
77.0
21 Mistral Small 3.1 24B OSS Mistral
Text vision
76.0
22 GPT-4.1 Nano OpenAI
Text vision
75.0
23 GPT-4o (extended) OpenAI
Text vision
75.0
24 Llama 4 Maverick OSS Meta
Text vision
75.0
25 Sonar Perplexity
Text vision
74.0
26 Nova Lite 1.0 Amazon
Text vision
72.0

Text-Only Models

These models process text input only. Ranked by quality score.

# Model Quality
1 Qwen3 235B A22B OSS Alibaba 87.0
2 DeepSeek V3.2 OSS DeepSeek 86.0
3 Mistral Large OSS Mistral 86.0
4 Grok 3 Beta xAI 85.0
5 DeepSeek R1 OSS DeepSeek 85.0
6 Command A OSS Cohere 82.0
7 Command R+ (08-2024) OSS Cohere 79.0
8 Llama 3.3 70B Instruct OSS Meta 79.0
9 QwQ 32B OSS Alibaba 78.0
10 DeepSeek V3 OSS DeepSeek 76.0
11 Reka Flash 3 Reka 74.0
12 Command R (08-2024) OSS Cohere 73.0
13 Mistral Nemo OSS Mistral 72.0
14 Qwen2.5 72B Instruct OSS Alibaba 71.0
15 Nova Micro 1.0 Amazon 68.0
16 Command R7B (12-2024) OSS Cohere 65.0

What are multimodal AI models?

Multimodal models can process multiple types of input — not just text, but also images, audio, and video. This enables use cases like:

  • Vision: Analyse images, read charts, describe photos, OCR documents
  • Audio: Transcribe speech, understand tone, process music
  • Video: Summarise videos, extract key frames, answer questions about video content