Building an LLM: The Complete Guide

📖 The one-paragraph version

A Large Language Model is often described as “a big pile of numbers called weights.” That’s true — but it’s like describing a restaurant as “a pile of ingredients.” The weights are the output of an enormous, multi-year engineering effort involving data curation, architecture design, scaling laws, hyperparameter tuning, distributed systems, alignment, evaluation, inference optimization, and deployment infrastructure. This post walks through every single piece — in both plain English and real technical detail — so you understand the full picture.

🧊 Part 1 — The Bigger Picture: Why Weights Are Just the Tip

Open any “how LLMs work” article and you’ll see the same story: weights, weights, weights. A matrix here, a gradient there, and boom — intelligence. It’s a clean narrative. It’s also misleading.

Weights are the artifact of training, not the training itself. They’re the final checkpoint file you download — but they encode the cumulative decisions of hundreds of engineers over months or years. To treat weights as “the model” is to confuse the score with the symphony.

💡 The Iceberg Model

What you see when you download a model: weights (the 10%).
What you don’t see: data pipelines, architecture research, scaling laws, hyperparameter sweeps, distributed training infrastructure, alignment, evaluation, inference optimization, deployment, monitoring, and the tacit intuition of researchers who know when a loss spike is recoverable vs fatal (the 90%).

This post covers the whole iceberg. We’ll start from the very first decision (how do we chop text into tokens?) and end with the last (how do we serve this thing to a million users without going bankrupt?). Along the way, you’ll understand why two teams with identical weights can produce wildly different products — and why the “secret sauce” of frontier labs is almost never in the weights themselves.

🔤 Part 2 — Tokenization: The First Decision

The Simple Version

Before a model can learn from text, the text has to be chopped into pieces the model can handle. These pieces are called tokens. A token might be a whole word (“hello”), part of a word (“un-” and “believ” and “able”), a character, or even a byte. The tokenizer is the translator between human text and model math.

The Technical Version

A tokenizer defines a vocabulary V of |V| tokens (typically 32,000 to 200,000) and a deterministic function that maps any Unicode string to a sequence of token IDs in {0, 1, ..., |V|-1}. Modern LLMs almost universally use subword tokenizers:

BPE (Byte-Pair Encoding) — GPT family, Llama. Iteratively merges the most frequent adjacent pair of tokens.
Unigram — SentencePiece (used by Llama, T5). Learns a probabilistic subword vocabulary.
WordPiece — BERT, older models. Similar to BPE but optimizes likelihood.
Byte-level BPE — GPT-2/3/4, Llama 3. Handles any Unicode without an UNK token.

Why This Matters More Than You Think

Tokenizer choice has downstream effects on everything:

Context efficiency: GPT-2’s tokenizer needed ~2× more tokens than English to represent Chinese. Modern multilingual tokenizers (Llama 3’s 128K vocab) fixed this.
Embedding size: Bigger vocab = bigger embedding matrix = more parameters.
Code handling: Code has lots of symbols ({ } [ ] = ==). Tokenizers trained mostly on English prose tokenize code inefficiently. Specialized code tokenizers help.
Special tokens: Models need tokens for <|begin_of_text|>, <|tool_call|>, <|image|>, etc. These shape what the model can do.

🔢 Tokenizer comparison

“Hello, world!” → GPT-2: 4 tokens · Llama 3: 4 tokens · Claude: 5 tokens · Gemini: 5 tokens. Same sentence, different tokenizations — which means different internal representations, even before training starts.

🏗️ Part 3 — Architecture: The Transformer Family

“Transformer” isn’t a single design — it’s a family. The original 2017 paper described an encoder-decoder architecture for translation. Modern LLMs use only the decoder half, and even that has evolved dramatically.

The Simple Version

Think of the architecture as the blueprint of a factory. The weights are the machines inside. You can put the same machines in two different factory layouts and get very different products. The layout matters.

Key Architectural Decisions

Choice	Options	Who Uses It	Tradeoff
Activation	ReLU, GELU, SwiGLU, ReGLU	SwiGLU: Llama, PaLM, Qwen. GELU: GPT.	SwiGLU is ~15% better than GELU but uses 50% more FFN params.
Normalization	LayerNorm, RMSNorm, DeepNorm, μP	RMSNorm: almost all modern LLMs. DeepNorm: very deep models.	RMSNorm drops the mean subtraction — faster, nearly identical quality.
Position encoding	Learned, Sinusoidal, RoPE, ALiBi, YaRN	RoPE: Llama, Qwen, Mistral. ALiBi: BLOOM, MPT.	RoPE extrapolates to longer contexts; ALiBi is simpler but less flexible.
Attention pattern	Full, Sliding window, Grouped-query, Multi-query	GQA: Llama 3, Mistral. MQA: Falcon. Sliding: Mistral, Longformer.	GQA/MQA massively reduce KV-cache memory at inference.
Dense vs MoE	All params active vs only some per token	MoE: Mixtral, DBRX, Grok-2, Qwen-MoE, Llama 4.	MoE: 70B quality at 13B inference cost. Harder to train and serve.
Context length	4K, 32K, 128K, 1M, 10M	128K+: Claude, Gemini, GPT-4. 1M+: Gemini 2.5.	Long context requires architectural changes (RoPE scaling, ring attention), not just training longer.
Alternative archs	State-space (Mamba), RWKV, RetNet	Jamba (hybrid), Mamba-2, Falcon-3.	Linear-time inference, but still catching up on quality.

These choices compound. A model with SwiGLU + RMSNorm + RoPE + GQA + MoE (the 2025-2026 “standard recipe”) is fundamentally different from GPT-3’s ReLU + LayerNorm + learned positions + dense design — even at the same parameter count.

🍲 Part 4 — Data: The Real Secret Sauce

If architecture is the factory blueprint and weights are the machines, data is the raw material. And in 2026, every serious lab will tell you the same thing: data quality matters more than data quantity.

The Simple Version

You can build the most beautiful kitchen in the world, but if you cook with rotten ingredients, you’ll serve rotten food. The same is true of LLMs. A 7B model trained on pristine, carefully curated data will often beat a 70B model trained on garbage.

The Technical Version

Pre-training data pipelines in 2026 typically involve:

Collection

Common Crawl (web), GitHub (code), arXiv + PubMed (academic), Wikipedia, books, StackExchange, multilingual sources, licensed content (Reddit, news), and increasingly synthetic data generated by stronger models.

Deduplication

Near-duplicate removal at document, paragraph, and n-gram levels. MinHash + LSH for fuzzy dedup at web scale. This alone can improve downstream quality by 10-40% — Llama 3’s team called it one of their biggest wins.

Quality Filtering

Heuristic filters (length, punctuation ratio, profanity, perplexity under a reference model) plus classifier-based filters (train a small model to distinguish “high quality” from “low quality” pages). Often iterated multiple times.

Domain Mixing

The mixing ratios are the lab’s most closely guarded secret. Typical mixes: 40-60% web, 10-20% code, 5-10% academic, 5-10% books, 5-10% multilingual, 5-10% synthetic. Tweaking these by a few percent can dramatically change model behavior.

Curriculum

The order in which data is presented matters. Some teams start with high-quality books/Wikipedia, then move to noisier web data. Others do the reverse. Some interleave by difficulty. This is active research.

The Synthetic Data Revolution

By 2025-2026, synthetic data has become a first-class citizen. Microsoft’s Phi models, NVIDIA’s Nemotron, and many others demonstrated that carefully generated synthetic textbooks, coding problems, and reasoning traces can match or beat real data for specific capabilities. The key insight: synthetic data lets you control the distribution exactly — no more hoping the web has enough good math problems.

⚠️ The contamination trap

If your synthetic data is generated by a model trained on your test benchmarks, you’ve accidentally memorized the answers. Labs now spend serious effort on contamination detection — making sure eval data never leaks into training.

📈 Part 5 — Scaling Laws: The Math of Getting Bigger

In 2020, OpenAI published a paper showing that model performance scales predictably with compute. In 2022, DeepMind’s Chinchilla paper flipped the industry on its head by showing that most models were undertrained — they had too many parameters for the amount of data they’d seen.

The Simple Version

There’s a formula for how good a model will be, given how big it is and how much data it’s seen. You can use this formula to plan a training run before you spend $100M. Ignoring scaling laws is how you waste a year of work.

The Technical Version

The Kaplan/Chinchilla scaling laws say that loss L scales as a power law with model size N, dataset size D, and compute C:

L(N) ∝ N^(-α)      L(D) ∝ D^(-β)      L(C) ∝ C^(-γ)

The Chinchilla-optimal ratio is roughly 20 tokens per parameter. So a 70B model should train on ~1.4T tokens. A 400B model should train on ~8T tokens. Llama 3 405B trained on 15T tokens — well beyond Chinchilla-optimal — because Meta could afford it and because more data always helps.

Model	Parameters	Tokens Trained On	Tokens/Param	Notes
GPT-3 (2020)	175B	300B	1.7×	Massively undertrained by modern standards.
Chinchilla (2022)	70B	1.4T	20×	The compute-optimal baseline.
Llama 2 70B (2023)	70B	2T	28×	Slightly over-trained for quality gains.
Gemini Ultra (2024)	~1.5T (MoE)	~36T	~24×	Frontier-class compute.
Llama 3 405B (2024)	405B	15T	37×	Well beyond Chinchilla-optimal.
Grok-3 (2025)	~300B+	~10T+	~33×	Trained on Colossus (100K H100s).
Frontier 2026	1T+ (MoE)	30T+	30-50×	The new baseline for frontier labs.

Modern labs also use small-scale proxy runs to predict large-scale performance — training a 1B model on 1/100th the data and using scaling laws to forecast what the 100B version will look like. This saves enormous money.

⚖️ Part 6 — What Is a Weight?

Now that we’ve covered the context, let’s zoom into the weights themselves.

The Simple Version

Imagine a giant light board with billions of dimmer switches. Each switch has a value between roughly -1 and +1. When a word comes in, the signal flows through the board, and every switch it passes through nudges the signal a little — amplifying some meanings, suppressing others, rotating concepts into new directions. By the time the signal reaches the end, it has been shaped into a prediction. Each dimmer switch is a weight.

The Technical Version

A weight is a single scalar floating-point number (typically float32, bfloat16, or float8) stored in a multi-dimensional tensor. Weights are organized into matrices that perform linear transformations on activation vectors.

For a single linear layer with input dimension d_in and output dimension d_out, the weight matrix W has shape [d_out, d_in] and there is usually a bias vector b of shape [d_out]. The computation is:

y = W·x + b

In a modern 70B-parameter model, these matrices are enormous. A single attention projection matrix might be [8192, 8192] — that’s 67 million weights in one matrix, and there are several per layer, across 80 layers.

🗺️ Part 7 — Where Do the Weights Live?

The weights are organized into named modules. Here’s the anatomy of a single transformer block:

┌─────────────────────────────────────────────────────────┐ │ Transformer Block (one of L identical copies) │ │ │ │ ┌───────────────────────────────────────────────────┐ │ │ │ Multi-Head Self-Attention │ │ │ │ • W_q [d, d] — query projection │ │ │ │ • W_k [d, d] — key projection │ │ │ │ • W_v [d, d] — value projection │ │ │ │ • W_o [d, d] — output projection │ │ │ └───────────────────────────────────────────────────┘ │ │ ↓ + residual │ │ ┌───────────────────────────────────────────────────┐ │ │ │ RMSNorm1 — γ₁ (scale only, no bias) │ │ │ └───────────────────────────────────────────────────┘ │ │ ↓ │ │ ┌───────────────────────────────────────────────────┐ │ │ │ Feed-Forward Network (SwiGLU MLP) │ │ │ │ • W_gate [4d, d] — gating │ │ │ │ • W_up [4d, d] — expand │ │ │ │ • W_down [d, 4d] — contract │ │ │ └───────────────────────────────────────────────────┘ │ │ ↓ + residual │ │ ┌───────────────────────────────────────────────────┐ │ │ │ RMSNorm2 — γ₂ │ │ │ └───────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘

📊 Part 8 — The Complete Weight Reference Table

Below is a scrollable reference of every major weight group in a modern decoder-only transformer. The header stays pinned as you scroll.

Weight Name	Shape	# Params (7B)	What It Does
`embed_tokens.weight`	[V, d]	131 M	Token embedding table. Looks up a dense vector for each token ID. Encodes semantic meaning of words/subwords at the input.
`layers[i].self_attn.q_proj.weight`	[d, d]	16.8 M × 32	Projects input into Query vectors. Determines “what is this token looking for?” in the attention mechanism.
`layers[i].self_attn.k_proj.weight`	[d, d]	16.8 M × 32	Projects input into Key vectors. Acts as the “label” other tokens match against.
`layers[i].self_attn.v_proj.weight`	[d, d]	16.8 M × 32	Projects input into Value vectors. The actual information that gets passed along when a token is attended to.
`layers[i].self_attn.o_proj.weight`	[d, d]	16.8 M × 32	Output projection. Mixes the concatenated outputs of all attention heads back into the model dimension.
`layers[i].mlp.gate_proj.weight`	[4d, d]	67.1 M × 32	SwiGLU gating projection. Decides which features to “turn on” for this token. Stores much of the model’s factual knowledge.
`layers[i].mlp.up_proj.weight`	[4d, d]	67.1 M × 32	SwiGLU up projection. Expands the representation into a higher-dimensional space for non-linear combinations.
`layers[i].mlp.down_proj.weight`	[d, 4d]	67.1 M × 32	Projects back down to model dimension. Combines the gated features into the residual stream.
`layers[i].input_layernorm.weight`	[d]	4 K × 32	RMSNorm scale. Stabilizes activations before attention. Keeps training from blowing up.
`layers[i].post_attention_layernorm.weight`	[d]	4 K × 32	RMSNorm scale before the MLP. Same stabilization job, second time per block.
`norm.weight`	[d]	4 K	Final RMSNorm applied to the residual stream before the LM head.
`lm_head.weight`	[V, d]	131 M	Projects final hidden state into vocabulary logits. The “decision layer” that picks the next token. Often weight-tied with `embed_tokens`.
`rotary_emb.inv_freq`	[d/2]	2 K	RoPE frequency table. Not trained — fixed sinusoidal schedule that encodes token position into Q and K.

🎯 Part 9 — How Are the Weights Set?

We don’t program the weights — we grow them.

The Simple Version

Imagine you’re blindfolded on a hilly landscape, trying to reach the lowest valley. You feel the slope under your feet and step downhill. Repeat a trillion times. That’s training — the landscape is the loss function, the slope is the gradient, and your feet are the optimizer.

The Technical Version

Initialization

Weights start as small random numbers (truncated normal, std ~0.02, or ℳSRO/Xavier). Bad initialization = dead gradients = training fails.

Forward Pass

A batch of token sequences (e.g., 4M tokens across 8 GPUs) flows through the model, producing logits.

Loss Computation

Cross-entropy loss between predicted logits and actual next tokens: L = -Σ log P(x_t | x_<t ; θ)

Backward Pass

Autograd computes ∂L/∂θ — the gradient with respect to every weight — via the chain rule.

Optimizer Step

AdamW updates each weight with adaptive learning rate, momentum, and weight decay: θ ← θ - η · m̂ / (√v̂ + ε) - λ·θ

Repeat for Trillions of Tokens

A frontier model trains for ~15T tokens. At 4M-token batches, that’s ~3.75M optimizer steps. On 16,000 H100s, this takes 3-6 months and $50M-$200M.

🧪 Part 10 — The Training Recipe: Hyperparameters

This is where labs’ “secret sauce” mostly lives. The same architecture and data can produce wildly different models based on hyperparameters.

Hyperparameter	Typical Value	What It Does	If You Get It Wrong
Learning rate (peak)	1e-4 to 6e-4	Step size for weight updates.	Too high: loss diverges. Too low: training crawls.
LR schedule	Warmup → cosine decay	Ramps LR up, then decays to ~10% of peak.	No warmup: early instability. No decay: never converges.
Batch size	2M – 16M tokens	Tokens per optimizer step.	Too small: noisy gradients. Too big: diminishing returns.
Weight decay	0.1	L2 regularization — discourages large weights.	Too high: underfitting. Too low: overfitting.
Gradient clipping	1.0 (norm)	Caps gradient magnitude to prevent blowups.	Missing: training collapses on rare bad batches.
Z-loss coefficient	1e-4 to 1e-5	Auxiliary loss that penalizes large logits (stabilizes softmax).	Missing: logit blowup in late training.
Precision	bf16 (training), fp8 (Blackwell)	Numeric format for weights and activations.	fp16: overflow risk. fp32: 2× slower, 2× memory.
Optimizer	AdamW (standard), Muon, Sophia	Algorithm for weight updates.	SGD: too slow for transformers. Bad Adam config: divergence.

Finding the right recipe requires hundreds of small proxy runs at 1B scale before committing to a 70B+ run. Labs like Meta, Google, and Anthropic publish papers describing these recipe choices — but the exact values are often kept private.

🌐 Part 11 — Distributed Systems: Training at Scale

A 70B model in bf16 is ~140 GB. That doesn’t fit on one GPU. A 405B model is ~810 GB. That doesn’t fit on 8 GPUs. Training these models is first and foremost a distributed systems problem.

The Simple Version

Imagine trying to read a 10,000-page encyclopedia with 1,000 friends. You can’t each read the whole thing — you have to split it up. But you also need to stay in sync, share notes, and make sure nobody’s pages are missing. That’s distributed training.

Parallelism Strategies

Strategy	What It Splits	When Used	Tradeoff
Data Parallel (DP)	Dataset across GPUs	Always. The baseline.	Each GPU has full model. All-reduce gradients per step.
FSDP / ZeRO	Optimizer states, gradients, params	Almost always for models > 7B.	Reduces memory 4-8×. Adds communication overhead.
Tensor Parallel (TP)	Individual weight matrices	Within a single node (NVLink).	Fast interconnect required. Splits each matmul across GPUs.
Pipeline Parallel (PP)	Layers across GPUs	Across nodes when model is too big.	Introduces “bubbles” — GPUs idle part of the time.
Sequence Parallel (SP)	Sequence dimension	Very long contexts (128K+).	Reduces activation memory. Complex to implement.
Context Parallel (CP)	Attention across GPUs	Million-token contexts.	Ring attention patterns. Used by Gemini for 1M+ context.
Expert Parallel (EP)	MoE experts across GPUs	MoE models only.	All-to-all communication per token. Tricky to balance.

Modern frontier training uses 3D or 4D parallelism — combining DP + TP + PP + (SP or EP) simultaneously. The Llama 3 405B run used a complex hybrid across 16,000 H100s.

The Engineering Reality

Fault tolerance: In a 16,000-GPU cluster, a GPU fails every few hours. Training must continue via automatic checkpointing and node replacement.
Network topology: NVLink within a node (900 GB/s), InfiniBand/RoCE between nodes (400-800 Gbps). Network design matters as much as GPU count.
Straggler detection: One slow node drags down the whole cluster. Systems must detect and replace stragglers automatically.
Checkpointing: Saving 1+ TB of state every few hundred steps without stopping training. Async checkpointing to fast storage is critical.

⚠️ The cosmic ray problem

At scale, a single GPU may silently produce a wrong number due to a cosmic ray bit-flip. Without careful validation, this corrupts the entire run. Labs build elaborate detection systems to catch these “silent data corruptions.”

🧬 Part 12 — The Three Phases of Making an LLM Useful

Raw pre-training produces a model that can complete text but won’t chat with you. Three phases turn it into an assistant.

Phase	Data Used	What Changes	Result
1. Pre-training	Trillions of tokens: web, books, code, Wikipedia, papers, synthetic	All weights updated. Next-token prediction loss.	Model knows language, facts, reasoning patterns, biases — but is a raw completion engine.
2. Supervised Fine-Tuning (SFT)	10K–1M high-quality (instruction, response) pairs written by humans or synthetically	All weights (or LoRA adapters) updated. Loss only on response tokens.	Model learns to follow instructions and produce assistant-style outputs.
3. RLHF / DPO / GRPO	Human or AI preference rankings: “Response A is better than B”	Weights nudged to increase probability of preferred responses.	Model becomes helpful, harmless, honest. Refuses dangerous requests. Matches human taste.
4. Rejection Sampling / RLAIF	AI-generated responses filtered by stronger AI judge	Further fine-tuning on verified outputs	Scales alignment without proportional human labeling cost.
5. Test-time compute (2025+)	No new weights — inference-time search	Model spends more FLOPs on harder problems via chain-of-thought, tree search, or verifier-guided sampling	o1/o3-style reasoning. Same weights, smarter inference.

🛡️ Part 13 — Alignment & Safety Infrastructure

Alignment isn’t a single step — it’s an infrastructure that spans the entire lifecycle.

The Layers of Safety

Data-level: Filtering harmful content from pre-training data. Refusals baked into SFT examples.
Training-level: RLHF with human reviewers; Constitutional AI (Anthropic) where the model critiques its own outputs against a set of principles; GRPO (DeepSeek) for reasoning-focused alignment.
Mechanistic-level: Interpretability research identifying “refusal neurons” and “deception circuits” in the model. Anthropic, Google DeepMind, and OpenAI’s super-alignment teams publish on this regularly.
System-level: Input/output guardrails (separate classifiers that filter prompts and responses), rate limiting, monitoring, incident response.
Red-teaming: Thousands of adversarial humans (and AI red-teamers) trying to break the model before release. Findings feed back into training.
Evals: Dangerous capability evaluations (bioweapons, cyber, persuasion) run before every release.

💡 The uncomfortable truth

We don’t really know how to make a model truly safe at superhuman capability levels. Current techniques work well for today’s models but may not scale. This is why “AI safety” is a massive research area — and why labs spend hundreds of millions on it.

🎨 Part 14 — Training Specific Characteristics & Specialties

Want a model that’s great at Python? Or speaks like a pirate? Or diagnoses rare diseases? Here’s how each is achieved.

14.1 — Specialty: Code Generation

The training mix is re-weighted to 30-50% code tokens (GitHub, StackOverflow, documentation). SFT on curated (instruction, code) pairs with unit tests. Models like DeepSeek-Coder and CodeLlama show that even a small amount of high-quality code SFT dramatically lifts pass@k scores.

14.2 — Specialty: Medical / Legal / Financial

Continued pre-training on domain-specific corpora (PubMed, case law, SEC filings) at low learning rate (~1e-5) to avoid catastrophic forgetting. SFT on expert-written Q&A pairs. RLHF with expert reviewers for high-stakes outputs.

14.3 — Personality & Tone

Personality is almost entirely a product of SFT data style and RLHF preferences. The underlying “intelligence” weights barely change — only the top-layer stylistic projections shift. This is why a single base model can produce both a terse coding assistant and a warm therapy bot with different fine-tunes.

14.4 — Safety & Refusals

Trained via red-teaming: adversarial humans try to elicit harmful outputs, and those conversations are added to RLHF with “refusal is preferred” labels. The model learns a refusal direction in activation space.

👁️ Part 15 — Adding Modalities: Vision, Audio, Video

A pure LLM only sees text. To make it multimodal, we add new encoder modules whose outputs are projected into the LLM’s embedding space.

Modality	New Weights Added	Training Approach	What It Enables
Vision	ViT or SigLIP encoder (~300M–2B params) + linear projector	Encoder frozen at first; projector trained on image-caption pairs; then end-to-end fine-tune	Image understanding, OCR, chart reading, visual Q&A
Audio / Speech	Whisper-style encoder or discrete audio tokenizer + projector	Train projector on (audio, transcript) pairs; then multi-task SFT	Transcription, speech understanding, audio reasoning
Video	Frame sampler + vision encoder + temporal encoder	Same as vision but with frame-order supervision	Action recognition, video summarization, temporal reasoning
Image Generation	Separate diffusion/flow-matching decoder; LLM emits special tokens	LLM trained to emit image-token sequences; decoder trained with denoising loss	Unified chat + image generation (GPT-4o, Gemini 2.5 Flash Image)
Tool Use	No new weights — just SFT data with special tokens	SFT on (prompt, tool_call_json, tool_result, answer) trajectories	Model can invoke search, calculators, APIs, sandboxes
Robotics	Action head: linear layer [d, action_dim] on top of LLM	SFT on (observation, instruction, action) tuples from robot demos	VLA models like RT-2, π₀ that turn language into motor commands

The key insight of modern multimodal architectures (Flamingo, LLaVA, GPT-4o, Gemini) is that the LLM backbone stays the same. We just add encoders that translate other modalities into “tokens” the LLM already understands.

📏 Part 16 — Evaluation: Measuring What Matters

You can’t improve what you can’t measure. Evaluation infrastructure is as important as training infrastructure — and often just as expensive.

The Benchmark Ecosystem

Benchmark	What It Tests	Strengths	Weaknesses
MMLU / MMLU-Pro	50+ academic subjects	Broad knowledge coverage. Standard for years.	Saturated by frontier models. Contamination risk.
GPQA	Graduate-level science Q&A	Hard enough to differentiate frontier models.	Narrow domain. Experts only.
HumanEval / MBPP	Python coding problems	Executes code — ground truth correctness.	Too easy now. Contaminated.
LiveBench / LiveCodeBench	Fresh problems updated monthly	Resists contamination. Tests current ability.	Smaller dataset. Less stable over time.
MATH / AIME	Competition math	Tests deep reasoning. Clear correctness.	Narrow. Doesn’t test real-world math.
Chatbot Arena (LMSYS)	Human preference via blind A/B	Measures what users actually care about. Elo-rated.	Expensive. Subjective. Gaming risk.
ARC-AGI	Abstract pattern reasoning	Tests generalization, not memorization.	Very hard. Even frontier models score low.
SWE-bench	Real GitHub issue resolution	Tests end-to-end coding ability.	Expensive to run. Contamination risk.

The Contamination Problem

If your training data contains the test questions, your benchmark scores are meaningless. Labs now use:

Canary strings embedded in benchmarks to detect leakage
Contamination detectors that compare training data against benchmark text
Evergreen benchmarks (LiveBench, LiveCodeBench) that update regularly
Held-out private evals that never get released

🎯 The eval arms race

By 2026, every frontier lab has an internal eval suite of 100+ benchmarks. Public benchmarks are necessary but not sufficient — the real differentiation happens on private evals tailored to the lab’s product goals.

⚡ Part 17 — Inference: Making It Usable

A model is useless if it’s too slow or expensive to serve. Inference optimization is a massive subfield — and in many ways harder than training, because you have to serve millions of concurrent users at low latency.

The Inference Stack

Technique	What It Does	Speedup / Savings	Tradeoff
KV Cache	Caches keys/values from previous tokens so they aren’t recomputed	Essential. Makes autoregressive generation feasible.	Dominates memory at long contexts.
FlashAttention	IO-aware exact attention algorithm	2-4× faster attention, less memory	Complex implementation. Now standard.
Quantization	Stores weights in lower precision (INT8, INT4, FP8)	2-4× memory reduction, 1.5-3× speedup	Small quality loss. GPTQ, AWQ, GGUF are common.
Continuous Batching	Groups requests dynamically instead of static batches	10-100× throughput at same latency	Complex scheduler. vLLM, TensorRT-LLM.
Speculative Decoding	Small “draft” model proposes tokens; big model verifies	2-3× speedup with no quality loss	Needs a good draft model. Implementation complexity.
PagedAttention	Virtual memory for KV cache — non-contiguous allocation	~2× more concurrent requests	Core of vLLM. Now industry standard.
Prefix Caching	Shares KV cache across requests with same system prompt	Massive savings for multi-tenant serving	Only helps when prefixes are shared.
Distillation	Train a small model to mimic a big one	10-100× smaller with 80-95% quality	Loses some capabilities. Licensing issues.

The best inference stacks in 2026 combine all of these. A single served request might go through FlashAttention → PagedAttention → continuous batching → INT4 quantization → speculative decoding. The result: frontier-model quality at consumer-hardware prices.

🛠️ Part 18 — How to Actually Train Your Own (Practical Guide)

You probably aren’t pre-training a 70B model from scratch. But here’s what’s realistic in 2026:

Goal	Method	Hardware Needed	Cost / Time
Custom chatbot personality	LoRA fine-tune on Llama 3 8B	1× RTX 4090 (24 GB)	$0.50/hr · a few hours
Domain expert (medical, legal)	Continued pre-training + LoRA SFT	4–8× A100 80GB	$150–500 · 1–3 days
Small specialized model from scratch	Pre-train 1B–3B on curated corpus	32–64× A100 80GB	$10K–50K · 1–3 weeks
Frontier-class model	Full pre-training 70B+	8,000+ H100 cluster	$50M+ · 3–6 months

The Practical Stack in 2026

Framework: PyTorch + torch.distributed + FSDP2, or DeepSpeed ZeRO-3
Tokenizers: SentencePiece or tiktoken, vocab ~100K–200K
Training loop: Hugging Face TRL, Nanotron, litgpt, or torchtitan for larger runs
Data: Dolma, RedPajama, StarCoderData, DCLM for pre-training; OpenHermes, Magpie, UltraChat for SFT
Alignment: TRL’s DPOTrainer, Open-Instruct, or TRL’s GRPOTrainer
Evaluation: lm-evaluation-harness, MMLU, GPQA, LiveBench, Chatbot Arena
Serving: vLLM, SGLang, TensorRT-LLM

💰 Part 19 — Cost & Economics

Let’s talk money. Building and running LLMs is expensive, and the economics drive many of the technical decisions.

The Cost Stack

Cost Category	Frontier Lab (Annual)	Mid-size Lab	Startup / Indie
Pre-training compute	$100M – $1B+	$5M – $50M	$0 (use open weights)
GPU cluster (owned)	100,000+ H100/B200	1,000 – 10,000	1 – 8 consumer GPUs
GPU cluster (cloud)	$200M – $1B+/yr	$10M – $100M/yr	$1K – $50K/yr
Data licensing	$50M – $500M	$1M – $20M	$0 (open data)
Human labelers (SFT/RLHF)	$50M – $200M	$500K – $10M	$0 (synthetic)
Research team	1,000+ PhDs	50 – 500	1 – 10
Inference serving	$500M – $2B+/yr	$5M – $50M/yr	$100 – $50K/mo

The Economics Shape the Tech

Many technical decisions are driven by cost:

MoE exists because inference cost matters more than training cost at scale
Quantization is driven by serving economics — every GB of memory saved is money
Distillation lets you serve a 7B model that behaves like a 70B
Speculative decoding cuts GPU hours, which cuts dollars
Open weights (Llama, Mistral, Qwen) exist because the training cost is so high that only a few players can afford it — but everyone can benefit from the outputs

🧠 Part 20 — Where the Weights “Store” Different Kinds of Knowledge

Recent work in mechanistic interpretability has mapped what different parts of the network actually do:

Early layers (1–10): Surface-level features — parts of speech, simple syntax, local context
Middle layers (10–50): The “knowledge layers.” Fact recall, entity disambiguation, induction heads. This is where most factual knowledge lives as sparse features.
Late layers (50–80): Task-specific computation, formatting, style. The “last mile” that converts internal representations into next-token distributions.
Attention heads: Some specialize in previous-token (local copying), some in induction (completing patterns), some in previous noun (coreference), some in doc-start (attention to context beginning).

💡 The superposition insight

Models represent far more concepts than they have neurons by encoding them as directions in activation space, not individual neurons. A 7B model may encode millions of concepts in its 4096-dimensional space by using near-orthogonal directions. This is why “finding the neuron for X” usually fails — X is a direction, not a neuron.

🔮 Part 21 — The Frontier: What’s Next

As of mid-2026, the field is moving fast on several fronts:

Test-time compute scaling: Models like o3, Claude 4’s extended thinking, and DeepSeek-R1 spend more inference FLOPs on harder problems. The weights are the same — only the inference path changes.
Mixture of Experts: Now the default for frontier models. Llama 4, Grok-3, and others use MoE to get 70B+ quality at 10-20B active-parameter inference cost.
State-space models: Mamba-2, Jamba, and hybrids offer linear-time inference. Still catching up on quality but improving fast.
Continuous training: Models that never stop learning — updating weights on a rolling stream of new data without forgetting. The holy grail of lifelong AI.
Weight merging: Taking two fine-tunes of the same base model and averaging their weights to combine capabilities. Surprisingly effective.
Agentic capabilities: Tool use, planning, memory, multi-step reasoning. Models that can browse the web, write and execute code, and coordinate with other models.
Long context: 1M+ token contexts are now standard. 10M+ is being researched. This requires architectural innovations like ring attention and YaRN.
Process reward models: Instead of just judging the final answer, judging each step of reasoning. Enables better math and coding.
Multimodal unification: Single models that handle text, images, audio, video, and actions in one conversation. GPT-4o, Gemini 2.5, and Claude 4 are all heading this way.

🔄 Part 22 — The Full Lifecycle: From Idea to Production

Let’s zoom out and see how all these pieces fit together in a real project.

Phase	Activities	Key Decisions	Deliverables
1. Research	Literature review, architecture exploration, small-scale experiments	What problem are we solving? What architecture? What scale?	Research plan, architecture spec, proxy run results
2. Data	Collection, cleaning, deduplication, quality filtering, mixing	What data sources? What mix? How much synthetic?	Curated training datasets, data cards, contamination reports
3. Infrastructure	Cluster setup, networking, storage, monitoring, fault tolerance	How many GPUs? What interconnect? What storage?	Production training cluster, monitoring dashboards
4. Pre-training	Large-scale training run, checkpointing, intervention	Learning rate schedule, batch size, when to stop	Base model checkpoint, training logs, loss curves
5. Evaluation	Benchmark runs, contamination checks, human evals	Is the model good enough? What are its weaknesses?	Evaluation report, benchmark scores, failure analysis
6. Post-training	SFT, RLHF/DPO, red-teaming, safety evals	What alignment strategy? What safety thresholds?	Aligned model checkpoint, safety report
7. Optimization	Quantization, distillation, inference optimization	What latency/throughput targets? What quality tradeoffs?	Optimized model, serving config, benchmarks
8. Deployment	Serving infrastructure, API design, monitoring, scaling	How to serve millions of requests? What SLA?	Production API, monitoring, incident response
9. Operation	Monitoring, feedback collection, continuous improvement	When to retrain? What data to collect?	Feedback loops, usage analytics, improvement roadmap

A frontier model goes through all nine phases over 12-24 months. Each phase has its own team, its own challenges, and its own failures. The weights you download at the end are just the final artifact of this enormous effort.

✅ Summary

Building an LLM is not just about weights. It’s about:

Tokenization — how you chop text into pieces
Architecture — the blueprint of the factory
Data — the raw material (and the real secret sauce)
Scaling laws — the math of getting bigger
Weights — the machines inside the factory
Training recipe — the hyperparameters and optimizer
Distributed systems — how you train across 16,000 GPUs
Post-training — SFT, RLHF, alignment
Safety — red-teaming, guardrails, interpretability
Specialties — code, medical, personality, modalities
Evaluation — measuring what matters
Inference — making it fast and cheap to serve
Economics — the cost reality that shapes everything
Operations — monitoring, feedback, continuous improvement

The weights are the model — but they’re also the output of an enormous, multi-year engineering effort. Understand the weights, and you understand the intelligence. Understand everything else, and you understand how to build it.

📚 In This Post

🧊 Part 1 — The Bigger Picture: Why Weights Are Just the Tip

🔤 Part 2 — Tokenization: The First Decision

The Simple Version

The Technical Version

Why This Matters More Than You Think

🏗️ Part 3 — Architecture: The Transformer Family

The Simple Version

Key Architectural Decisions

🍲 Part 4 — Data: The Real Secret Sauce

The Simple Version

The Technical Version

Collection

Deduplication

Quality Filtering

Domain Mixing

Curriculum

The Synthetic Data Revolution

📈 Part 5 — Scaling Laws: The Math of Getting Bigger

The Simple Version

The Technical Version

⚖️ Part 6 — What Is a Weight?

The Simple Version

The Technical Version

🗺️ Part 7 — Where Do the Weights Live?

📊 Part 8 — The Complete Weight Reference Table

🎯 Part 9 — How Are the Weights Set?

The Simple Version

The Technical Version

Initialization

Forward Pass

Loss Computation

Backward Pass

Optimizer Step

Repeat for Trillions of Tokens

🧪 Part 10 — The Training Recipe: Hyperparameters

🌐 Part 11 — Distributed Systems: Training at Scale

The Simple Version

Parallelism Strategies

The Engineering Reality

🧬 Part 12 — The Three Phases of Making an LLM Useful

🛡️ Part 13 — Alignment & Safety Infrastructure

The Layers of Safety

🎨 Part 14 — Training Specific Characteristics & Specialties

14.1 — Specialty: Code Generation

14.2 — Specialty: Medical / Legal / Financial

14.3 — Personality & Tone

14.4 — Safety & Refusals

👁️ Part 15 — Adding Modalities: Vision, Audio, Video

📏 Part 16 — Evaluation: Measuring What Matters

The Benchmark Ecosystem

The Contamination Problem

⚡ Part 17 — Inference: Making It Usable

The Inference Stack

🛠️ Part 18 — How to Actually Train Your Own (Practical Guide)

The Practical Stack in 2026

💰 Part 19 — Cost & Economics

The Cost Stack

The Economics Shape the Tech

🧠 Part 20 — Where the Weights “Store” Different Kinds of Knowledge

🔮 Part 21 — The Frontier: What’s Next

🔄 Part 22 — The Full Lifecycle: From Idea to Production

✅ Summary

Featured Product

Cart

Search