Multimodal AI Models
26 LLMs that can process more than just text — vision, audio, video, and more. Ranked by capability breadth and quality.
0
Vision (Image Input)
5
Audio Input
0
Video Input
16
Text Only
Multimodal Models
| # | Model | Modalities | Quality |
|---|---|---|---|
| 1 | Gemini 2.5 Pro Google | Text vision Audio | 83.0 |
| 2 | Gemini 2.0 Flash Google | Text vision Audio | 81.0 |
| 3 | Gemini 2.5 Flash Google | Text vision Audio | 78.0 |
| 4 | Gemini 2.5 Flash Lite Google | Text vision Audio | 78.0 |
| 5 | Gemini 2.0 Flash Lite Google | Text vision Audio | 76.0 |
| 6 | GPT-5.2 OpenAI | Text vision | 90.0 |
| 7 | O4 Mini OpenAI | Text vision | 90.0 |
| 8 | Claude Opus 4.6 Anthropic | Text vision | 89.0 |
| 9 | Grok 4 xAI | Text vision | 88.0 |
| 10 | O3 OpenAI | Text vision | 88.0 |
| 11 | O3 Pro OpenAI | Text vision | 88.0 |
| 12 | GPT-5 OpenAI | Text vision | 87.0 |
| 13 | Claude Sonnet 4.6 Anthropic | Text vision | 86.0 |
| 14 | Claude Opus 4 Anthropic | Text vision | 84.0 |
| 15 | Claude 3.5 Haiku Anthropic | Text vision | 82.0 |
| 16 | GPT-4o-mini OpenAI | Text vision | 80.0 |
| 17 | Claude Sonnet 4 Anthropic | Text vision | 79.0 |
| 18 | GPT-5 Nano OpenAI | Text vision | 78.0 |
| 19 | Nova Pro 1.0 Amazon | Text vision | 78.0 |
| 20 | GPT-4.1 OpenAI | Text vision | 77.0 |
| 21 | Mistral Small 3.1 24B OSS Mistral | Text vision | 76.0 |
| 22 | GPT-4.1 Nano OpenAI | Text vision | 75.0 |
| 23 | GPT-4o (extended) OpenAI | Text vision | 75.0 |
| 24 | Llama 4 Maverick OSS Meta | Text vision | 75.0 |
| 25 | Sonar Perplexity | Text vision | 74.0 |
| 26 | Nova Lite 1.0 Amazon | Text vision | 72.0 |
Text-Only Models
These models process text input only. Ranked by quality score.
| # | Model | Quality |
|---|---|---|
| 1 | Qwen3 235B A22B OSS Alibaba | 87.0 |
| 2 | DeepSeek V3.2 OSS DeepSeek | 86.0 |
| 3 | Mistral Large OSS Mistral | 86.0 |
| 4 | Grok 3 Beta xAI | 85.0 |
| 5 | DeepSeek R1 OSS DeepSeek | 85.0 |
| 6 | Command A OSS Cohere | 82.0 |
| 7 | Command R+ (08-2024) OSS Cohere | 79.0 |
| 8 | Llama 3.3 70B Instruct OSS Meta | 79.0 |
| 9 | QwQ 32B OSS Alibaba | 78.0 |
| 10 | DeepSeek V3 OSS DeepSeek | 76.0 |
| 11 | Reka Flash 3 Reka | 74.0 |
| 12 | Command R (08-2024) OSS Cohere | 73.0 |
| 13 | Mistral Nemo OSS Mistral | 72.0 |
| 14 | Qwen2.5 72B Instruct OSS Alibaba | 71.0 |
| 15 | Nova Micro 1.0 Amazon | 68.0 |
| 16 | Command R7B (12-2024) OSS Cohere | 65.0 |
What are multimodal AI models?
Multimodal models can process multiple types of input — not just text, but also images, audio, and video. This enables use cases like:
- Vision: Analyse images, read charts, describe photos, OCR documents
- Audio: Transcribe speech, understand tone, process music
- Video: Summarise videos, extract key frames, answer questions about video content