The Anatomy of an LLM: How Transformers Actually Work

The Anatomy of an LLM – Deep Dive (Simple Explanations)
🧠 Inside an LLM Self‑attention · Transformer blocks · Token flow Q K V Token out Multi‑head attention
📖 Plain‑English summary first: A Large Language Model (LLM) is like a very advanced autocomplete. It reads a sequence of words (your prompt) and predicts the next word, over and over. To do that well, it needs to understand which words relate to which in a sentence – even if they’re far apart. The Transformer architecture uses a trick called “self‑attention” to let every word look at every other word at the same time. This post explains exactly how that works, from the ground up.

🎯 The Big Picture: What is an LLM trying to do?

Imagine you see the sentence: “The cat sat on the mat because it was tired.” What does “it” refer to? The cat? The mat? A human would know – the cat is tired. An LLM must figure this out mathematically. It doesn’t “understand” like a person, but it learns patterns from billions of sentences. The Transformer is the architecture that made this possible at massive scale.

Before Transformers (pre‑2017), models like RNNs processed words one by one, keeping a “memory” of previous words. That memory faded over long distances – they couldn’t easily connect the first word of a paragraph to the last. Transformers solved this by processing all words in parallel and letting each word “attend” to every other word, no matter how far apart.

🧱 The Transformer Architecture – A Layered Cake

A modern LLM (like GPT‑4) is a stack of many identical “blocks” – typically 32 to 96 layers. Each block contains two main parts: a self‑attention layer and a feed‑forward network. Between them there are also residual connections (shortcuts that help training) and layer normalization (which keeps numbers stable).

Analogy: Imagine a team of editors reviewing a document. Each editor (token) reads the entire document (all tokens) and highlights which other sentences are most relevant to the sentence they are editing. That’s attention. Then each editor rewrites their sentence based on those highlighted references – that’s the feed‑forward network. After many rounds (layers), the document becomes coherent.

Step 1: Tokenization and Embeddings

Before any math, the model must convert text into numbers. It doesn’t see letters; it sees tokens – small pieces of words (e.g., “play”, “ing”, “ cat”). Each token gets a unique ID. Then the model looks up a high‑dimensional vector (a list of hundreds of numbers) for that token – that’s the embedding. These embeddings are learned during training so that similar words (e.g., “happy” and “joyful”) have similar vectors.

Example: The sentence “AI is fun” might be tokenized as [“AI”, “ is”, “ fun”]. Each becomes a vector of, say, 768 numbers. But wait – the model also needs to know the order of tokens. That’s why we add positional encodings – a unique signal for each position (first, second, third). The final input to the first layer is token_embedding + position_encoding.

Step 2: Self‑Attention – The Secret Sauce

Now the magic. For each token, the model creates three new vectors: Query (Q), Key (K), and Value (V). Think of Q as a “question” the token asks: “Which other tokens are relevant to me?”. K is a “label” on each token: “Here’s what I’m about”. V is the actual information that will be passed along if the token is chosen.

Simple analogy: In a library, each book (token) has a Query (what it’s looking for), a Key (what topics it covers), and a Value (the full text). When you search, you compare your Query to all Keys, and retrieve the Values of the best matches.

The attention score between token i and j is the dot product of Qᵢ and Kⱼ. A high score means token i thinks token j is relevant. Then we apply a softmax to turn scores into probabilities. Finally, the output for token i is the weighted sum of all Values Vⱼ, using those probabilities as weights.

Mathematically (but explained):

Attention(Q,K,V) = softmax( Q·K^T / √d ) · V

The division by √d (the square root of the vector dimension) prevents scores from becoming too large. This formula is done in parallel for all tokens, making it extremely efficient on GPUs.

Multi‑head attention means we do this entire process several times (e.g., 8 or 32 times) with different learned projections of Q, K, V. Each “head” learns a different type of relationship. One head might focus on nearby words (grammar), another on distant coreferences (like “it” → “cat”), another on topic consistency. The outputs of all heads are concatenated and linearly transformed.

Step 3: Feed‑Forward Networks

After attention, each token’s representation is passed through a small two‑layer neural network. This is the same for every token (they share weights) but applied independently. It adds non‑linearity and allows the model to “think” about the attended information. Typically, the inner layer is 4 times larger than the embedding dimension (e.g., 768 → 3072 → 768). This is where most of the parameters live.

Step 4: Residual Connections & Normalization

Deep networks are hard to train. Residual connections add the input of a sublayer to its output (so the layer learns only the difference). This prevents gradients from vanishing. Layer normalization re‑centers and re‑scales the activations, making training faster and more stable.

🎓 Training an LLM: Three Giant Phases

Building an LLM like GPT‑4 costs tens of millions of dollars and uses thousands of GPUs. But the process is conceptually simple:

  1. Pre‑training (unsupervised learning)
    The model is shown trillions of tokens from the internet, books, Wikipedia, etc. It learns to predict the next token given the previous ones. That’s all. By doing this, it implicitly learns grammar, facts, reasoning patterns, biases, and even some world models. This phase is the most expensive – it can take months on clusters of 10,000+ GPUs.
  2. Supervised Fine‑Tuning (SFT)
    Now we teach the model to follow instructions. We collect thousands of high‑quality question‑answer pairs written by humans. The model is further trained to produce those answers. It learns to be a helpful assistant rather than just a next‑token predictor.
  3. Reinforcement Learning from Human Feedback (RLHF)
    Humans rank multiple outputs from the model for the same prompt (e.g., “Which answer is more helpful and less harmful?”). A separate “reward model” is trained to predict these human preferences. Then the LLM is optimized using reinforcement learning (PPO algorithm) to maximize the reward model’s score. This aligns the model with human values – reducing toxicity, improving helpfulness, and adding refusal capabilities for unsafe requests.

Simpler analogy: First, the model reads the entire internet (pre‑training). Then a teacher shows it how to answer questions (fine‑tuning). Then a group of people votes on which answers are best, and the model learns to produce more of those (RLHF).

🔍 Why This Matters for Your AI Strategy (Practical takeaways)
  • Context window limits: Attention is O(n²) – doubling the context quadruples the compute. That’s why GPT‑4’s 128k context is much more expensive than 8k. For very long documents, use RAG instead of stuffing everything into the prompt.
  • Inference costs: The feed‑forward layers are the main cost per token. If you need low‑cost, high‑volume generation, consider smaller models or distillation.
  • Fine‑tuning efficiency: Full fine‑tuning of a 70B model requires ~280 GB of VRAM. With LoRA (Low‑Rank Adaptation), you can fine‑tune on a single consumer GPU by only updating small low‑rank matrices. This is a game‑changer for domain adaptation.
  • Why LLMs hallucinate: They are next‑token predictors, not fact databases. They have no internal “truth” check. That’s why grounding with retrieval (RAG) is essential for factual applications.

📊 A Quick Tour of Model Sizes

Parameters are the weights in the neural network. More parameters generally mean more capability (but also more cost).

  • Small (1B–7B) – Run on a laptop or phone. Examples: Phi‑3 mini, Gemma 2. Good for classification, extraction, simple Q&A.
  • Medium (13B–34B) – One or two consumer GPUs. Examples: Llama 2 13B, Mistral 7B. Decent reasoning, good for many business tasks.
  • Large (70B–200B) – Multiple high‑end GPUs (A100/H100). Examples: Llama 3 70B, GPT‑3.5. Strong reasoning, code generation.
  • Giant (200B+) – Only API‑accessible. Examples: GPT‑4, Claude 3 Opus. State‑of‑the‑art on complex multi‑step tasks.

Next: Open‑source vs. proprietary models – which one should you actually use?

Author: Jon-Paul Walton