All Posts

How Transformer Architecture Works: A Deep Dive

The Transformer is the engine behind ChatGPT, BERT, and Claude. We break down Self-Attention, Mul...

Abstract AlgorithmsAbstract Algorithms
ยทยท6 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: The Transformer is the architecture behind every major LLM (GPT, BERT, Claude, Gemini). Its core innovation is Self-Attention โ€” a mechanism that lets the model weigh relationships between all tokens in a sequence simultaneously, regardless of distance. This enables parallelism that RNNs could not achieve.


๐Ÿ“– The Cocktail Party Listener: Attention as a Mental Model

Before Transformers, RNNs read text left-to-right, one word at a time, like reading a book. By the time the model reached the end of a long sentence, earlier words had faded from its hidden state.

Transformers are different. They stand in the center of a cocktail party and listen to everyone simultaneously:

  • 80% attention on the friend telling a story (relevant context).
  • 10% on the background music (modifier words).
  • 10% on the waiter (punctuation, structure).

The model builds this attention map over the entire sentence at once โ€” no sequential dependency. This is why Transformers train faster on modern GPUs: every token can be processed in parallel.


๐Ÿ”ข From Tokens to Embeddings: Preparing the Input

Before any attention computation, input text is converted to tensors:

  1. Tokenization: Split text into subword tokens.
    "unhappiness" โ†’ ["un", "##happ", "##iness"] (WordPiece / BPE)

  2. Token Embeddings: Map each token ID to a learned dense vector (dimension = 768 for BERT-base, 12288 for GPT-4).

  3. Positional Encodings: Attention has no inherent sense of order. A positional encoding vector is added to each token embedding so the model knows position 0, position 1, etc.

Final Input = Token Embedding + Positional Encoding
ComponentDimensionLearnable?
Token embedding768 (BERT-base)โœ…
Positional encoding (sinusoidal)768โŒ (fixed)
Positional encoding (learned)768โœ… (GPT-2+)

โš™๏ธ Self-Attention: How Every Token Reads the Room

The Q, K, V Framework

For each token, the model creates three vectors via learned linear projections:

  • Q (Query): "What am I looking for?"
  • K (Key): "What do I advertise as my content?"
  • V (Value): "What do I contribute if selected?"

The attention score between token $i$ and token $j$ is:

$$\text{score}(i, j) = \frac{Q_i \cdot K_j^T}{\sqrt{d_k}}$$

Divide by $\sqrt{d_k}$ (square root of key dimension) to prevent dot products from growing so large that softmax saturates.

Softmax normalizes scores into weights that sum to 1:

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Concrete Example

Sentence: "The animal didn't cross the street because it was too tired."

When the model processes the token "it", self-attention computes:

  • High score to "animal" โ†’ "it" refers to the animal.
  • Low score to "street" โ†’ not the referent.

Without attention, a model might incorrectly resolve the pronoun based on proximity.

flowchart LR
    subgraph Self-Attention for "it"
        it["Token: it"] -->|high score| animal["Token: animal"]
        it -->|low score| street["Token: street"]
    end

๐Ÿง  Multi-Head Attention: Learning Parallel Relationship Types

Running attention once captures one type of relationship. Multi-Head Attention runs $h$ parallel attention mechanisms (heads) and concatenates their outputs.

HeadWhat it tends to learn
Head 1Syntactic relations (subject โ†’ verb)
Head 2Coreference resolution (pronoun โ†’ noun)
Head 3Semantic similarity (synonyms)
Head 4+Long-range dependencies, modifier attachment, etc.
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O
where head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)

GPT-3: 96 heads ร— 128 dimensions each = 12,288-dim model.


๐Ÿ—๏ธ The Full Encoder-Decoder Architecture

The original "Attention Is All You Need" (Vaswani et al., 2017) paper used an Encoder-Decoder structure:

flowchart TD
    Input["Input Tokens\n(source language)"] --> Encoder["Encoder Stack\n(N ร— {Self-Attention + FFN})"]
    Encoder --> CrossAttn["Cross-Attention\n(decoder reads encoder output)"]
    Output["Output Tokens\n(target language, shifted right)"] --> Decoder["Decoder Stack\n(N ร— {Masked Self-Attention + Cross-Attention + FFN})"]
    Decoder --> CrossAttn
    CrossAttn --> Linear["Linear + Softmax"]
    Linear --> Prediction["Next Token"]

Encoder-only (BERT): Reads the full sequence bidirectionally. Best for classification, NER, embeddings.
Decoder-only (GPT): Autoregressive: each token attends only to past tokens (causal mask). Best for generation.
Encoder-Decoder (T5, BART): Input โ†’ encoder; generate โ†’ decoder. Best for translation, summarization.


โš–๏ธ Transformer Scaling Laws and Limitations

Scaling Laws (Chinchilla, 2022)

Training loss decreases predictably with model size and data:

$$L \propto N^{-\alpha} \cdot D^{-\beta}$$

Where $N$ = parameters, $D$ = training tokens. Chinchilla showed that many "large" models were undertrained โ€” optimal compute splits ~50/50 between model size and data quantity.

Quadratic Attention Complexity

Self-attention computes dot products between all pairs of tokens:

$$\text{Complexity} = O(n^2 \cdot d)$$

For a 4096-token context, that's ~16M dot products per layer. This is why long-context models (100K+ tokens) require approximations:

  • Sparse attention (Longformer, BigBird): attend only to local windows + global tokens.
  • Flash Attention: I/O-aware CUDA kernel that avoids materializing the full $n \times n$ attention matrix in HBM.
  • RoPE + ALiBi: Positional encodings that generalize better to unseen context lengths.

Layer Normalization and Residual Connections

Each sub-layer (attention, FFN) is wrapped in:

output = LayerNorm(x + Sublayer(x))

The residual connection (x +) prevents gradient vanishing in deep stacks (GPT-4 reportedly uses ~120 layers). LayerNorm stabilizes training without batch statistics.


๐Ÿ›ก๏ธ Inside the FFN: Position-wise Feed-Forward Network

After attention, each token passes independently through a 2-layer MLP:

$$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2$$

The intermediate dimension is typically 4ร— the model dimension (e.g., 3072 hidden for 768-dim BERT). This FFN is where much of the model's factual knowledge is believed to be stored โ€” key-value memories for facts like "Paris is the capital of France."


๐Ÿ“Œ Summary

  • Self-Attention computes pairwise importance scores across all tokens using Q/K/V projections. Complexity is $O(n^2)$.
  • Multi-Head Attention runs $h$ parallel attention streams to capture different relationship types.
  • Positional Encoding injects token order information because attention itself is permutation-invariant.
  • Encoder-only (BERT) = bidirectional; decoder-only (GPT) = autoregressive causal; encoder-decoder (T5) = seq2seq.
  • Flash Attention + sparse patterns address the quadratic memory bottleneck at long contexts.
  • FFN layers act as key-value memories; residual connections + LayerNorm enable very deep stacks.

๐Ÿ“ Practice Quiz

  1. Why is self-attention divided by $\sqrt{d_k}$ before the softmax?

    • A) To normalize token embeddings to unit length.
    • B) To prevent dot products from growing so large that softmax gradients vanish (saturation).
    • C) To apply weight decay during training.
      Answer: B
  2. A model must process a 100,000-token book. What is the main bottleneck with standard self-attention?

    • A) Tokenization speed.
    • B) $O(n^2)$ memory โ€” the 100K ร— 100K attention matrix requires ~40 GB of memory per layer.
    • C) The FFN cannot handle sequences longer than 4096 tokens.
      Answer: B
  3. What is the key difference between an encoder-only model (BERT) and a decoder-only model (GPT)?

    • A) BERT uses more parameters.
    • B) BERT attends bidirectionally (full context); GPT uses a causal mask (only past tokens) for autoregressive generation.
    • C) GPT uses positional encodings; BERT does not.
      Answer: B

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms