What is Multimodal (Multimodal AI)?

Definition

Multimodal AI refers to models that can understand and generate multiple types of data — text, images, audio, video, and code — rather than being limited to a single modality like text-only.

Early LLMs were text-in, text-out. You couldn’t show them an image or play them audio. Modern multimodal models like Claude, GPT-4o, and Gemini can process images alongside text, understand screenshots, read handwritten notes, analyze charts, and even work with audio and video.

This is a fundamental shift in how AI is used. Instead of describing a problem in words, you can show the model a screenshot of an error, a photo of a whiteboard sketch, or a chart from a dashboard. Multimodal models understand the world more like humans do — through multiple senses, not just language.

How It Works

UNIMODAL (text only):
  Text → [LLM] → Text

MULTIMODAL (multiple modalities):
  Text   ──┐
  Image  ──┤→ [Multimodal Model] → Text / Image / Audio
  Audio  ──┘

How models process different modalities:

  Image input:
    Image → Vision Encoder (ViT) → Image tokens → Transformer
    "A photo of a cat" becomes ~1,000 tokens the model can attend to

  Audio input:
    Audio → Audio Encoder (Whisper) → Audio tokens → Transformer
    Speech becomes tokens just like text

  All modalities become tokens in the same space,
  so the model can reason across them using attention.

Input vs Output Modalities

Model	Text In	Image In	Audio In	Text Out	Image Out	Audio Out
Claude	Yes	Yes	No	Yes	No	No
GPT-4o	Yes	Yes	Yes	Yes	Yes	Yes
Gemini	Yes	Yes	Yes	Yes	Yes	No
Llama 3	Yes	Yes	No	Yes	No	No

Why It Matters

Richer interactions — “Fix this error” + screenshot is faster and more precise than describing the error in text
Accessibility — AI can describe images for visually impaired users, transcribe audio for deaf users
New use cases — Document understanding, visual Q&A, video analysis, and UI testing are all enabled by multimodal capabilities
Real-world understanding — Multimodal models can interpret the visual world: read signs, understand diagrams, analyze medical images

Example

# Claude vision — analyze an image
from anthropic import Anthropic
import base64

client = Anthropic()

# Read an image file
with open("architecture_diagram.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_data,
                },
            },
            {
                "type": "text",
                "text": "Explain this architecture diagram. What are the main components and how do they interact?"
            }
        ],
    }]
)

print(response.content[0].text)
# → Detailed explanation of the architecture shown in the image

# Practical: Analyze a chart/graph
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": chart_image_data,
                },
            },
            {
                "type": "text",
                "text": "Extract the data from this bar chart as a markdown table. Include all values."
            }
        ],
    }]
)
# → Markdown table with extracted data from the chart

# Multiple images in one request
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": before_image}},
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": after_image}},
            {"type": "text", "text": "Compare these two UI designs. What changed between v1 and v2?"}
        ],
    }]
)
# → Detailed comparison of the two designs

Common Multimodal Use Cases

Use Case	Input	What the Model Does
Bug reports	Screenshot + description	Identifies the error and suggests fixes
Document processing	Scanned PDF image	Extracts text, tables, and structure
Code from design	UI mockup image	Generates HTML/CSS matching the design
Data extraction	Chart/graph image	Converts visual data to structured format
Accessibility	Image	Generates detailed alt-text descriptions
Visual QA	Photo + question	Answers questions about image content

Key Takeaways

Multimodal AI processes multiple data types (text, images, audio, video) in a single model
All modalities are converted to tokens, letting the model reason across them using the same attention mechanism
Claude, GPT-4o, and Gemini all support image understanding; audio and video support varies
Practical applications: analyzing screenshots, extracting data from charts, understanding documents, comparing designs
Multimodal capabilities are rapidly expanding — expect most models to handle text, images, audio, and video within the next year

Part of the DeepRaft Glossary — AI and ML terms explained for developers.

Multimodal