ml-fundamentals

Transformer

Learn what Transformer (Transformer Architecture) means in AI and machine learning, with examples and related concepts.

Definition

The Transformer is the neural network architecture behind every major LLM today — GPT-4, Claude, Gemini, Llama, and more.

Introduced in the 2017 paper “Attention Is All You Need” by Google researchers, the Transformer replaced older architectures (RNNs, LSTMs) by processing all tokens in parallel using an attention mechanism, rather than one at a time. This parallelism made it practical to train on massive datasets, leading to the LLM revolution.

How It Works

A Transformer processes text through stacked layers, each containing two components:

Input Tokens → [Embedding Layer]

              ┌──────────────┐
              │  Self-Attention  │  ← "which tokens should I focus on?"
              └──────────────┘

              ┌──────────────┐
              │  Feed-Forward    │  ← "process the attended information"
              └──────────────┘

              (repeat 96+ times)

              [Output Probabilities]

Self-attention is the key innovation: for each token, the model computes how much “attention” it should pay to every other token in the sequence. “The cat sat on the mat because it was tired” — attention helps the model understand “it” refers to “cat,” not “mat.”

Why It Matters

Architecture Variants

VariantUsed InKey Feature
Decoder-onlyGPT, Claude, LlamaText generation (autoregressive)
Encoder-onlyBERT, RoBERTaText understanding/classification
Encoder-decoderT5, BARTTranslation, summarization

Key Takeaways


Part of the DeepRaft Glossary — AI and ML terms explained for developers.