What is Attention Mechanism?

Definition

Attention is the mechanism that allows an AI model to focus on the most relevant parts of its input when generating each piece of output. It’s the core innovation behind the Transformer architecture — and therefore behind every modern LLM.

Before attention, language models processed text sequentially and struggled with long-range dependencies. The sentence “The cat, which had been sleeping on the windowsill all morning, finally woke up and stretched its legs” is hard for a sequential model because “its” refers to “cat,” which appeared many words ago. Attention solves this by letting the model directly look at every other word when processing each word, regardless of distance.

The 2017 paper “Attention Is All You Need” (Vaswani et al.) showed that you could build a model using only attention — no recurrence, no convolutions. This architecture became the Transformer, which powers GPT, Claude, Gemini, and virtually every modern AI model.

How It Works

Attention computes a weighted relationship between every pair of tokens. For each token, the model asks: “How relevant is every other token to understanding this one?”

Input: "The bank by the river was flooded"

Processing the word "bank":
  "The"     → 0.05  (low relevance)
  "bank"    → 0.10
  "by"      → 0.08
  "the"     → 0.04
  "river"   → 0.52  ← HIGH (disambiguates "bank" = riverbank)
  "was"     → 0.06
  "flooded" → 0.15  ← medium (reinforces the riverbank meaning)

The mathematical process:

Each token is projected into three vectors: Query (Q), Key (K), and Value (V)
Attention scores = dot product of Q with all K vectors (how relevant is each token?)
Weights = softmax of scores (normalize to probabilities)
Output = weighted sum of V vectors (blend the relevant information)

Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

Where:
  Q = "What am I looking for?" (the current token's query)
  K = "What do I contain?"     (each token's key)
  V = "What information do I carry?" (each token's value)
  d_k = dimension of keys (scaling factor for numerical stability)

Multi-Head Attention

In practice, models run multiple attention computations in parallel — called heads. Each head can learn to focus on different types of relationships:

Head 1 might track grammatical subject-verb agreement
Head 2 might track coreference (what “it” or “they” refers to)
Head 3 might track semantic similarity
Head 4 might track positional proximity

GPT-4 uses 96+ attention heads per layer across 100+ layers.

Why It Matters

Long-range understanding — Attention lets models connect information across thousands of tokens, enabling analysis of entire codebases or books
Parallelization — Unlike sequential processing, attention computes all token relationships simultaneously on GPUs, making training dramatically faster
Interpretability — Attention weights show what the model focused on, providing some insight into its reasoning
Context windows — The quadratic cost of attention (every token attends to every other token) is why context windows have size limits and why expanding them is expensive

Example

# Visualize attention patterns using a simple PyTorch implementation
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V):
    """Core attention computation used in every Transformer."""
    d_k = Q.size(-1)

    # Step 1: Compute attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)

    # Step 2: Normalize to probabilities
    weights = F.softmax(scores, dim=-1)

    # Step 3: Weighted sum of values
    output = torch.matmul(weights, V)

    return output, weights

# Example: 4 tokens, embedding dimension 8
Q = torch.randn(1, 4, 8)  # queries
K = torch.randn(1, 4, 8)  # keys
V = torch.randn(1, 4, 8)  # values

output, attention_weights = scaled_dot_product_attention(Q, K, V)

print("Attention weights (each row = what that token attends to):")
print(attention_weights.squeeze().detach())
# → 4x4 matrix: row i shows how much token i attends to each other token

Key Takeaways

Attention lets each token directly “look at” every other token, solving the long-range dependency problem
It works via Query-Key-Value projections: compute relevance scores, then blend information accordingly
Multi-head attention runs multiple attention patterns in parallel, capturing different relationship types
Attention is computationally quadratic in sequence length (O(n^2)), which is why context windows have limits
Every modern LLM (GPT, Claude, Gemini, Llama) is built on attention — it’s the foundation of the AI revolution

Part of the DeepRaft Glossary — AI and ML terms explained for developers.

Attention Mechanism