Multimodal AI Models
33 LLMs that can process more than just text — vision, audio, video, and more. Ranked by capability breadth and quality.
0
Vision (Image Input)
5
Audio Input
0
Video Input
21
Text Only
Multimodal Models
| # | Model | Modalities | Quality |
|---|---|---|---|
| 1 | Gemini 2.5 Pro Google | Text vision Audio | 83.0 |
| 2 | Gemini 2.0 Flash Google | Text vision Audio | 81.0 |
| 3 | Gemini 2.5 Flash Google | Text vision Audio | 78.0 |
| 4 | Gemini 2.0 Flash Lite Google | Text vision Audio | 75.0 |
| 5 | Gemini 2.5 Flash Lite Google | Text vision Audio | 74.0 |
| 6 | GPT-5.2 Pro OpenAI | Text vision | 93.0 |
| 7 | GPT-5 Pro OpenAI | Text vision | 90.0 |
| 8 | GPT-5.2 OpenAI | Text vision | 90.0 |
| 9 | O4 Mini OpenAI | Text vision | 90.0 |
| 10 | Claude Opus 4.6 Anthropic | Text vision | 89.0 |
| 11 | O3 OpenAI | Text vision | 88.0 |
| 12 | O3 Pro OpenAI | Text vision | 88.0 |
| 13 | GPT-5 OpenAI | Text vision | 87.0 |
| 14 | Claude Opus 4.5 Anthropic | Text vision | 86.0 |
| 15 | Claude Sonnet 4.6 Anthropic | Text vision | 86.0 |
| 16 | Claude Opus 4 Anthropic | Text vision | 84.0 |
| 17 | o1 OpenAI | Text vision | 84.0 |
| 18 | Claude Sonnet 4 Anthropic | Text vision | 79.0 |
| 19 | Claude Sonnet 4.5 Anthropic | Text vision | 79.0 |
| 20 | Llama 4 Scout OSS Meta | Text vision | 79.0 |
| 21 | GPT-5 Nano OpenAI | Text vision | 78.0 |
| 22 | Nova Pro 1.0 Amazon | Text vision | 78.0 |
| 23 | GPT-4.1 OpenAI | Text vision | 77.0 |
| 24 | Claude 3.5 Haiku Anthropic | Text vision | 76.0 |
| 25 | Claude Haiku 4.5 Anthropic | Text vision | 76.0 |
| 26 | Llama 4 Maverick OSS Meta | Text vision | 76.0 |
| 27 | GPT-4.1 Mini OpenAI | Text vision | 75.0 |
| 28 | GPT-4.1 Nano OpenAI | Text vision | 75.0 |
| 29 | GPT-4o (2024-05-13) OpenAI | Text vision | 75.0 |
| 30 | GPT-4o-mini OpenAI | Text vision | 74.0 |
| 31 | Sonar Perplexity | Text vision | 74.0 |
| 32 | Mistral Small 3.1 24B OSS Mistral | Text vision | 72.0 |
| 33 | Nova Lite 1.0 Amazon | Text vision | 72.0 |
Text-Only Models
These models process text input only. Ranked by quality score.
| # | Model | Quality |
|---|---|---|
| 1 | Qwen3 235B A22B OSS Alibaba | 87.0 |
| 2 | Qwen3 Max OSS Alibaba | 85.0 |
| 3 | DeepSeek R1 OSS DeepSeek | 85.0 |
| 4 | o3 Mini OpenAI | 84.0 |
| 5 | R1 0528 OSS DeepSeek | 83.0 |
| 6 | Qwen2.5 Coder 32B Instruct OSS Alibaba | 82.0 |
| 7 | Command A OSS Cohere | 80.0 |
| 8 | Command R+ (08-2024) OSS Cohere | 79.0 |
| 9 | DeepSeek V3.2 OSS DeepSeek | 77.0 |
| 10 | Llama 3.1 70B Instruct OSS Meta | 77.0 |
| 11 | DeepSeek V3 OSS DeepSeek | 76.0 |
| 12 | Phi 4 OSS Microsoft | 74.0 |
| 13 | Reka Flash 3 Reka | 74.0 |
| 14 | Command R (08-2024) OSS Cohere | 73.0 |
| 15 | Mistral Large OSS Mistral | 73.0 |
| 16 | Mistral Nemo OSS Mistral | 72.0 |
| 17 | Llama 3.3 70B Instruct OSS Meta | 71.0 |
| 18 | Qwen2.5 72B Instruct OSS Alibaba | 71.0 |
| 19 | Llama 3.1 8B Instruct OSS Meta | 68.0 |
| 20 | Nova Micro 1.0 Amazon | 68.0 |
| 21 | Command R7B (12-2024) OSS Cohere | 65.0 |
What are multimodal AI models?
Multimodal models can process multiple types of input — not just text, but also images, audio, and video. This enables use cases like:
- Vision: Analyse images, read charts, describe photos, OCR documents
- Audio: Transcribe speech, understand tone, process music
- Video: Summarise videos, extract key frames, answer questions about video content