ml-fundamentals

Attention Mechanism

Learn what Attention Mechanism means in AI and machine learning, with examples and related concepts.

Definition

Attention is the mechanism that allows an AI model to focus on the most relevant parts of its input when generating each piece of output. It’s the core innovation behind the Transformer architecture — and therefore behind every modern LLM.

Before attention, language models processed text sequentially and struggled with long-range dependencies. The sentence “The cat, which had been sleeping on the windowsill all morning, finally woke up and stretched its legs” is hard for a sequential model because “its” refers to “cat,” which appeared many words ago. Attention solves this by letting the model directly look at every other word when processing each word, regardless of distance.

The 2017 paper “Attention Is All You Need” (Vaswani et al.) showed that you could build a model using only attention — no recurrence, no convolutions. This architecture became the Transformer, which powers GPT, Claude, Gemini, and virtually every modern AI model.

How It Works

Attention computes a weighted relationship between every pair of tokens. For each token, the model asks: “How relevant is every other token to understanding this one?”

Input: "The bank by the river was flooded"

Processing the word "bank":
  "The"     → 0.05  (low relevance)
  "bank"    → 0.10
  "by"      → 0.08
  "the"     → 0.04
  "river"   → 0.52  ← HIGH (disambiguates "bank" = riverbank)
  "was"     → 0.06
  "flooded" → 0.15  ← medium (reinforces the riverbank meaning)

The mathematical process:

  1. Each token is projected into three vectors: Query (Q), Key (K), and Value (V)
  2. Attention scores = dot product of Q with all K vectors (how relevant is each token?)
  3. Weights = softmax of scores (normalize to probabilities)
  4. Output = weighted sum of V vectors (blend the relevant information)
Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

Where:
  Q = "What am I looking for?" (the current token's query)
  K = "What do I contain?"     (each token's key)
  V = "What information do I carry?" (each token's value)
  d_k = dimension of keys (scaling factor for numerical stability)

Multi-Head Attention

In practice, models run multiple attention computations in parallel — called heads. Each head can learn to focus on different types of relationships:

GPT-4 uses 96+ attention heads per layer across 100+ layers.

Why It Matters

Example

# Visualize attention patterns using a simple PyTorch implementation
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V):
    """Core attention computation used in every Transformer."""
    d_k = Q.size(-1)

    # Step 1: Compute attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)

    # Step 2: Normalize to probabilities
    weights = F.softmax(scores, dim=-1)

    # Step 3: Weighted sum of values
    output = torch.matmul(weights, V)

    return output, weights

# Example: 4 tokens, embedding dimension 8
Q = torch.randn(1, 4, 8)  # queries
K = torch.randn(1, 4, 8)  # keys
V = torch.randn(1, 4, 8)  # values

output, attention_weights = scaled_dot_product_attention(Q, K, V)

print("Attention weights (each row = what that token attends to):")
print(attention_weights.squeeze().detach())
# → 4x4 matrix: row i shows how much token i attends to each other token

Key Takeaways


Part of the DeepRaft Glossary — AI and ML terms explained for developers.