Transformer Architecture Overview

The transformer, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al., is the architectural foundation of every modern large language model. Understanding it at an intuitive level — even without deep math — will make you a far more effective practitioner when working with LLMs.

Why Transformers Replaced RNNs

Before transformers, the dominant architecture for processing sequential text was the recurrent neural network (RNN) and its variants (LSTM, GRU). RNNs processed text one token at a time, left to right, maintaining a hidden state that was passed from step to step.

This had two critical problems:

Sequential bottleneck: Each step depended on the previous step, so training couldn't be parallelized. Training on 1 billion tokens meant 1 billion sequential operations.
Vanishing gradients: Information from early tokens in a long sequence was difficult to preserve through hundreds of recurrent steps.

Transformers solved both by processing all tokens in parallel using a mechanism called self-attention.

The High-Level Architecture

A decoder-only transformer (like GPT-4 or Claude) consists of:

Input tokens
     ↓
Token Embedding + Positional Encoding
     ↓
┌─────────────────────────────────┐
│  Transformer Block × N layers   │
│  ┌─────────────────────────┐   │
│  │  Multi-Head Self-Attention│   │
│  └──────────┬──────────────┘   │
│             │ + residual        │
│  ┌──────────┴──────────────┐   │
│  │  Feed-Forward Network   │   │
│  └──────────┬──────────────┘   │
│             │ + residual        │
└─────────────┼───────────────────┘
              ↓
     Layer Normalization
              ↓
     Linear + Softmax
              ↓
     Next Token Probabilities

Token Embeddings

Before processing, text is converted to tokens (subword units), then each token is mapped to a high-dimensional vector (embedding). GPT-4 uses 128,000-dimensional embeddings. These vectors encode semantic meaning — words with similar meanings cluster together in this space.

Positional encodings are added to tell the model where each token appears in the sequence (since self-attention itself is position-agnostic).

Self-Attention: The Key Innovation

Self-attention allows each token to "look at" every other token in the sequence to gather context. For the token "bank" in "river bank," attention allows the model to focus on "river" to understand the meaning is geographic, not financial.

Mathematically, attention computes three vectors for each token:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I offer?"

Attention scores are computed as: softmax(Q × Kᵀ / √d_k) × V

The square root scaling prevents the scores from becoming too large in high dimensions, which would cause the softmax to produce near-zero gradients.

Multi-Head Attention

Rather than computing attention once, transformers compute it multiple times in parallel with different learned weight matrices (heads). Each head can learn to attend to different kinds of relationships:

Head 1 might track syntactic subject-verb agreement
Head 2 might track coreference (which "it" refers to)
Head 3 might track semantic similarity

The outputs of all heads are concatenated and projected back to the model dimension.

Feed-Forward Networks and Residual Connections

After attention, each token's representation passes through a position-wise feed-forward network (FFN) — two linear transformations with a non-linear activation in between. This is where much of the model's "knowledge" is believed to be stored.

Residual connections (adding the input to the output of each sublayer) and layer normalization are critical for stable training of very deep networks (100+ layers in large models).

Decoder-Only vs. Encoder-Decoder

Modern chat LLMs (GPT, Claude, Gemini, Llama) use decoder-only architecture — they only generate tokens left to right (autoregressive). The original transformer was encoder-decoder (for translation: encode source language, decode to target). BERT uses encoder-only architecture (for classification and embedding).

Understanding which architecture you're working with matters because it affects what the model is optimized for and how you should prompt it.