Multimodal
Learn what Multimodal (Multimodal AI) means in AI and machine learning, with examples and related concepts.
Definition
Multimodal AI refers to models that can understand and generate multiple types of data — text, images, audio, video, and code — rather than being limited to a single modality like text-only.
Early LLMs were text-in, text-out. You couldn’t show them an image or play them audio. Modern multimodal models like Claude, GPT-4o, and Gemini can process images alongside text, understand screenshots, read handwritten notes, analyze charts, and even work with audio and video.
This is a fundamental shift in how AI is used. Instead of describing a problem in words, you can show the model a screenshot of an error, a photo of a whiteboard sketch, or a chart from a dashboard. Multimodal models understand the world more like humans do — through multiple senses, not just language.
How It Works
UNIMODAL (text only):
Text → [LLM] → Text
MULTIMODAL (multiple modalities):
Text ──┐
Image ──┤→ [Multimodal Model] → Text / Image / Audio
Audio ──┘
How models process different modalities:
Image input:
Image → Vision Encoder (ViT) → Image tokens → Transformer
"A photo of a cat" becomes ~1,000 tokens the model can attend to
Audio input:
Audio → Audio Encoder (Whisper) → Audio tokens → Transformer
Speech becomes tokens just like text
All modalities become tokens in the same space,
so the model can reason across them using attention.
Input vs Output Modalities
| Model | Text In | Image In | Audio In | Text Out | Image Out | Audio Out |
|---|---|---|---|---|---|---|
| Claude | Yes | Yes | No | Yes | No | No |
| GPT-4o | Yes | Yes | Yes | Yes | Yes | Yes |
| Gemini | Yes | Yes | Yes | Yes | Yes | No |
| Llama 3 | Yes | Yes | No | Yes | No | No |
Why It Matters
- Richer interactions — “Fix this error” + screenshot is faster and more precise than describing the error in text
- Accessibility — AI can describe images for visually impaired users, transcribe audio for deaf users
- New use cases — Document understanding, visual Q&A, video analysis, and UI testing are all enabled by multimodal capabilities
- Real-world understanding — Multimodal models can interpret the visual world: read signs, understand diagrams, analyze medical images
Example
# Claude vision — analyze an image
from anthropic import Anthropic
import base64
client = Anthropic()
# Read an image file
with open("architecture_diagram.png", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
},
},
{
"type": "text",
"text": "Explain this architecture diagram. What are the main components and how do they interact?"
}
],
}]
)
print(response.content[0].text)
# → Detailed explanation of the architecture shown in the image
# Practical: Analyze a chart/graph
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": chart_image_data,
},
},
{
"type": "text",
"text": "Extract the data from this bar chart as a markdown table. Include all values."
}
],
}]
)
# → Markdown table with extracted data from the chart
# Multiple images in one request
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": before_image}},
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": after_image}},
{"type": "text", "text": "Compare these two UI designs. What changed between v1 and v2?"}
],
}]
)
# → Detailed comparison of the two designs
Common Multimodal Use Cases
| Use Case | Input | What the Model Does |
|---|---|---|
| Bug reports | Screenshot + description | Identifies the error and suggests fixes |
| Document processing | Scanned PDF image | Extracts text, tables, and structure |
| Code from design | UI mockup image | Generates HTML/CSS matching the design |
| Data extraction | Chart/graph image | Converts visual data to structured format |
| Accessibility | Image | Generates detailed alt-text descriptions |
| Visual QA | Photo + question | Answers questions about image content |
Key Takeaways
- Multimodal AI processes multiple data types (text, images, audio, video) in a single model
- All modalities are converted to tokens, letting the model reason across them using the same attention mechanism
- Claude, GPT-4o, and Gemini all support image understanding; audio and video support varies
- Practical applications: analyzing screenshots, extracting data from charts, understanding documents, comparing designs
- Multimodal capabilities are rapidly expanding — expect most models to handle text, images, audio, and video within the next year
Part of the DeepRaft Glossary — AI and ML terms explained for developers.