Attention Mechanism
Learn what Attention Mechanism means in AI and machine learning, with examples and related concepts.
Definition
Attention is the mechanism that allows an AI model to focus on the most relevant parts of its input when generating each piece of output. It’s the core innovation behind the Transformer architecture — and therefore behind every modern LLM.
Before attention, language models processed text sequentially and struggled with long-range dependencies. The sentence “The cat, which had been sleeping on the windowsill all morning, finally woke up and stretched its legs” is hard for a sequential model because “its” refers to “cat,” which appeared many words ago. Attention solves this by letting the model directly look at every other word when processing each word, regardless of distance.
The 2017 paper “Attention Is All You Need” (Vaswani et al.) showed that you could build a model using only attention — no recurrence, no convolutions. This architecture became the Transformer, which powers GPT, Claude, Gemini, and virtually every modern AI model.
How It Works
Attention computes a weighted relationship between every pair of tokens. For each token, the model asks: “How relevant is every other token to understanding this one?”
Input: "The bank by the river was flooded"
Processing the word "bank":
"The" → 0.05 (low relevance)
"bank" → 0.10
"by" → 0.08
"the" → 0.04
"river" → 0.52 ← HIGH (disambiguates "bank" = riverbank)
"was" → 0.06
"flooded" → 0.15 ← medium (reinforces the riverbank meaning)
The mathematical process:
- Each token is projected into three vectors: Query (Q), Key (K), and Value (V)
- Attention scores = dot product of Q with all K vectors (how relevant is each token?)
- Weights = softmax of scores (normalize to probabilities)
- Output = weighted sum of V vectors (blend the relevant information)
Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V
Where:
Q = "What am I looking for?" (the current token's query)
K = "What do I contain?" (each token's key)
V = "What information do I carry?" (each token's value)
d_k = dimension of keys (scaling factor for numerical stability)
Multi-Head Attention
In practice, models run multiple attention computations in parallel — called heads. Each head can learn to focus on different types of relationships:
- Head 1 might track grammatical subject-verb agreement
- Head 2 might track coreference (what “it” or “they” refers to)
- Head 3 might track semantic similarity
- Head 4 might track positional proximity
GPT-4 uses 96+ attention heads per layer across 100+ layers.
Why It Matters
- Long-range understanding — Attention lets models connect information across thousands of tokens, enabling analysis of entire codebases or books
- Parallelization — Unlike sequential processing, attention computes all token relationships simultaneously on GPUs, making training dramatically faster
- Interpretability — Attention weights show what the model focused on, providing some insight into its reasoning
- Context windows — The quadratic cost of attention (every token attends to every other token) is why context windows have size limits and why expanding them is expensive
Example
# Visualize attention patterns using a simple PyTorch implementation
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V):
"""Core attention computation used in every Transformer."""
d_k = Q.size(-1)
# Step 1: Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
# Step 2: Normalize to probabilities
weights = F.softmax(scores, dim=-1)
# Step 3: Weighted sum of values
output = torch.matmul(weights, V)
return output, weights
# Example: 4 tokens, embedding dimension 8
Q = torch.randn(1, 4, 8) # queries
K = torch.randn(1, 4, 8) # keys
V = torch.randn(1, 4, 8) # values
output, attention_weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights (each row = what that token attends to):")
print(attention_weights.squeeze().detach())
# → 4x4 matrix: row i shows how much token i attends to each other token
Key Takeaways
- Attention lets each token directly “look at” every other token, solving the long-range dependency problem
- It works via Query-Key-Value projections: compute relevance scores, then blend information accordingly
- Multi-head attention runs multiple attention patterns in parallel, capturing different relationship types
- Attention is computationally quadratic in sequence length (O(n^2)), which is why context windows have limits
- Every modern LLM (GPT, Claude, Gemini, Llama) is built on attention — it’s the foundation of the AI revolution
Part of the DeepRaft Glossary — AI and ML terms explained for developers.