A Large Language Model is often described as “a big pile of numbers called weights.” That’s true โ but it’s like describing a restaurant as “a pile of ingredients.” The weights are the output of an enormous, multi-year engineering effort involving data curation, architecture design, scaling laws, hyperparameter tuning, distributed systems, alignment, evaluation, inference optimization, and deployment infrastructure. This post walks through every single piece โ in both plain English and real technical detail โ so you understand the full picture.
๐ In This Post
- The Bigger Picture
- Tokenization
- Architecture
- Data: The Real Secret Sauce
- Scaling Laws
- What Is a Weight?
- Where Weights Live
- Complete Weight Reference
- How Weights Are Set
- The Training Recipe
- Distributed Systems
- The Three Phases
- Alignment & Safety
- Specialties
- Modalities
- Evaluation
- Inference
- Practical Guide
- Cost & Economics
- Where Knowledge Lives
- The Frontier
- The Full Lifecycle
๐ง Part 1 โ The Bigger Picture: Why Weights Are Just the Tip
Open any “how LLMs work” article and you’ll see the same story: weights, weights, weights. A matrix here, a gradient there, and boom โ intelligence. It’s a clean narrative. It’s also misleading.
Weights are the artifact of training, not the training itself. They’re the final checkpoint file you download โ but they encode the cumulative decisions of hundreds of engineers over months or years. To treat weights as “the model” is to confuse the score with the symphony.
What you see when you download a model: weights (the 10%).
What you don’t see: data pipelines, architecture research, scaling laws, hyperparameter sweeps, distributed training infrastructure, alignment, evaluation, inference optimization, deployment, monitoring, and the tacit intuition of researchers who know when a loss spike is recoverable vs fatal (the 90%).
This post covers the whole iceberg. We’ll start from the very first decision (how do we chop text into tokens?) and end with the last (how do we serve this thing to a million users without going bankrupt?). Along the way, you’ll understand why two teams with identical weights can produce wildly different products โ and why the “secret sauce” of frontier labs is almost never in the weights themselves.
๐ค Part 2 โ Tokenization: The First Decision
The Simple Version
Before a model can learn from text, the text has to be chopped into pieces the model can handle. These pieces are called tokens. A token might be a whole word (“hello”), part of a word (“un-” and “believ” and “able”), a character, or even a byte. The tokenizer is the translator between human text and model math.
The Technical Version
A tokenizer defines a vocabulary V of |V| tokens (typically 32,000 to 200,000) and a deterministic function that maps any Unicode string to a sequence of token IDs in {0, 1, ..., |V|-1}. Modern LLMs almost universally use subword tokenizers:
- BPE (Byte-Pair Encoding) โ GPT family, Llama. Iteratively merges the most frequent adjacent pair of tokens.
- Unigram โ SentencePiece (used by Llama, T5). Learns a probabilistic subword vocabulary.
- WordPiece โ BERT, older models. Similar to BPE but optimizes likelihood.
- Byte-level BPE โ GPT-2/3/4, Llama 3. Handles any Unicode without an UNK token.
Why This Matters More Than You Think
Tokenizer choice has downstream effects on everything:
- Context efficiency: GPT-2’s tokenizer needed ~2ร more tokens than English to represent Chinese. Modern multilingual tokenizers (Llama 3’s 128K vocab) fixed this.
- Embedding size: Bigger vocab = bigger embedding matrix = more parameters.
- Code handling: Code has lots of symbols (
{ } [ ] = ==). Tokenizers trained mostly on English prose tokenize code inefficiently. Specialized code tokenizers help. - Special tokens: Models need tokens for
<|begin_of_text|>,<|tool_call|>,<|image|>, etc. These shape what the model can do.
“Hello, world!” โ GPT-2: 4 tokens ยท Llama 3: 4 tokens ยท Claude: 5 tokens ยท Gemini: 5 tokens. Same sentence, different tokenizations โ which means different internal representations, even before training starts.
๐๏ธ Part 3 โ Architecture: The Transformer Family
“Transformer” isn’t a single design โ it’s a family. The original 2017 paper described an encoder-decoder architecture for translation. Modern LLMs use only the decoder half, and even that has evolved dramatically.
The Simple Version
Think of the architecture as the blueprint of a factory. The weights are the machines inside. You can put the same machines in two different factory layouts and get very different products. The layout matters.
Key Architectural Decisions
| Choice | Options | Who Uses It | Tradeoff |
|---|---|---|---|
| Activation | ReLU, GELU, SwiGLU, ReGLU | SwiGLU: Llama, PaLM, Qwen. GELU: GPT. | SwiGLU is ~15% better than GELU but uses 50% more FFN params. |
| Normalization | LayerNorm, RMSNorm, DeepNorm, ฮผP | RMSNorm: almost all modern LLMs. DeepNorm: very deep models. | RMSNorm drops the mean subtraction โ faster, nearly identical quality. |
| Position encoding | Learned, Sinusoidal, RoPE, ALiBi, YaRN | RoPE: Llama, Qwen, Mistral. ALiBi: BLOOM, MPT. | RoPE extrapolates to longer contexts; ALiBi is simpler but less flexible. |
| Attention pattern | Full, Sliding window, Grouped-query, Multi-query | GQA: Llama 3, Mistral. MQA: Falcon. Sliding: Mistral, Longformer. | GQA/MQA massively reduce KV-cache memory at inference. |
| Dense vs MoE | All params active vs only some per token | MoE: Mixtral, DBRX, Grok-2, Qwen-MoE, Llama 4. | MoE: 70B quality at 13B inference cost. Harder to train and serve. |
| Context length | 4K, 32K, 128K, 1M, 10M | 128K+: Claude, Gemini, GPT-4. 1M+: Gemini 2.5. | Long context requires architectural changes (RoPE scaling, ring attention), not just training longer. |
| Alternative archs | State-space (Mamba), RWKV, RetNet | Jamba (hybrid), Mamba-2, Falcon-3. | Linear-time inference, but still catching up on quality. |
These choices compound. A model with SwiGLU + RMSNorm + RoPE + GQA + MoE (the 2025-2026 “standard recipe”) is fundamentally different from GPT-3’s ReLU + LayerNorm + learned positions + dense design โ even at the same parameter count.
๐ฒ Part 4 โ Data: The Real Secret Sauce
If architecture is the factory blueprint and weights are the machines, data is the raw material. And in 2026, every serious lab will tell you the same thing: data quality matters more than data quantity.
The Simple Version
You can build the most beautiful kitchen in the world, but if you cook with rotten ingredients, you’ll serve rotten food. The same is true of LLMs. A 7B model trained on pristine, carefully curated data will often beat a 70B model trained on garbage.
The Technical Version
Pre-training data pipelines in 2026 typically involve:
Collection
Common Crawl (web), GitHub (code), arXiv + PubMed (academic), Wikipedia, books, StackExchange, multilingual sources, licensed content (Reddit, news), and increasingly synthetic data generated by stronger models.
Deduplication
Near-duplicate removal at document, paragraph, and n-gram levels. MinHash + LSH for fuzzy dedup at web scale. This alone can improve downstream quality by 10-40% โ Llama 3’s team called it one of their biggest wins.
Quality Filtering
Heuristic filters (length, punctuation ratio, profanity, perplexity under a reference model) plus classifier-based filters (train a small model to distinguish “high quality” from “low quality” pages). Often iterated multiple times.
Domain Mixing
The mixing ratios are the lab’s most closely guarded secret. Typical mixes: 40-60% web, 10-20% code, 5-10% academic, 5-10% books, 5-10% multilingual, 5-10% synthetic. Tweaking these by a few percent can dramatically change model behavior.
Curriculum
The order in which data is presented matters. Some teams start with high-quality books/Wikipedia, then move to noisier web data. Others do the reverse. Some interleave by difficulty. This is active research.
The Synthetic Data Revolution
By 2025-2026, synthetic data has become a first-class citizen. Microsoft’s Phi models, NVIDIA’s Nemotron, and many others demonstrated that carefully generated synthetic textbooks, coding problems, and reasoning traces can match or beat real data for specific capabilities. The key insight: synthetic data lets you control the distribution exactly โ no more hoping the web has enough good math problems.
If your synthetic data is generated by a model trained on your test benchmarks, you’ve accidentally memorized the answers. Labs now spend serious effort on contamination detection โ making sure eval data never leaks into training.
๐ Part 5 โ Scaling Laws: The Math of Getting Bigger
In 2020, OpenAI published a paper showing that model performance scales predictably with compute. In 2022, DeepMind’s Chinchilla paper flipped the industry on its head by showing that most models were undertrained โ they had too many parameters for the amount of data they’d seen.
The Simple Version
There’s a formula for how good a model will be, given how big it is and how much data it’s seen. You can use this formula to plan a training run before you spend $100M. Ignoring scaling laws is how you waste a year of work.
The Technical Version
The Kaplan/Chinchilla scaling laws say that loss L scales as a power law with model size N, dataset size D, and compute C:
L(N) โ N^(-ฮฑ) L(D) โ D^(-ฮฒ) L(C) โ C^(-ฮณ)
The Chinchilla-optimal ratio is roughly 20 tokens per parameter. So a 70B model should train on ~1.4T tokens. A 400B model should train on ~8T tokens. Llama 3 405B trained on 15T tokens โ well beyond Chinchilla-optimal โ because Meta could afford it and because more data always helps.
| Model | Parameters | Tokens Trained On | Tokens/Param | Notes |
|---|---|---|---|---|
| GPT-3 (2020) | 175B | 300B | 1.7ร | Massively undertrained by modern standards. |
| Chinchilla (2022) | 70B | 1.4T | 20ร | The compute-optimal baseline. |
| Llama 2 70B (2023) | 70B | 2T | 28ร | Slightly over-trained for quality gains. |
| Gemini Ultra (2024) | ~1.5T (MoE) | ~36T | ~24ร | Frontier-class compute. |
| Llama 3 405B (2024) | 405B | 15T | 37ร | Well beyond Chinchilla-optimal. |
| Grok-3 (2025) | ~300B+ | ~10T+ | ~33ร | Trained on Colossus (100K H100s). |
| Frontier 2026 | 1T+ (MoE) | 30T+ | 30-50ร | The new baseline for frontier labs. |
Modern labs also use small-scale proxy runs to predict large-scale performance โ training a 1B model on 1/100th the data and using scaling laws to forecast what the 100B version will look like. This saves enormous money.
โ๏ธ Part 6 โ What Is a Weight?
Now that we’ve covered the context, let’s zoom into the weights themselves.
The Simple Version
Imagine a giant light board with billions of dimmer switches. Each switch has a value between roughly -1 and +1. When a word comes in, the signal flows through the board, and every switch it passes through nudges the signal a little โ amplifying some meanings, suppressing others, rotating concepts into new directions. By the time the signal reaches the end, it has been shaped into a prediction. Each dimmer switch is a weight.
The Technical Version
A weight is a single scalar floating-point number (typically float32, bfloat16, or float8) stored in a multi-dimensional tensor. Weights are organized into matrices that perform linear transformations on activation vectors.
For a single linear layer with input dimension d_in and output dimension d_out, the weight matrix W has shape [d_out, d_in] and there is usually a bias vector b of shape [d_out]. The computation is:
y = Wยทx + b
In a modern 70B-parameter model, these matrices are enormous. A single attention projection matrix might be [8192, 8192] โ that’s 67 million weights in one matrix, and there are several per layer, across 80 layers.
๐บ๏ธ Part 7 โ Where Do the Weights Live?
The weights are organized into named modules. Here’s the anatomy of a single transformer block:
๐ Part 8 โ The Complete Weight Reference Table
Below is a scrollable reference of every major weight group in a modern decoder-only transformer. The header stays pinned as you scroll.
| Weight Name | Shape | # Params (7B) | What It Does |
|---|---|---|---|
embed_tokens.weight |
[V, d] | 131 M | Token embedding table. Looks up a dense vector for each token ID. Encodes semantic meaning of words/subwords at the input. |
layers[i].self_attn.q_proj.weight |
[d, d] | 16.8 M ร 32 | Projects input into Query vectors. Determines “what is this token looking for?” in the attention mechanism. |
layers[i].self_attn.k_proj.weight |
[d, d] | 16.8 M ร 32 | Projects input into Key vectors. Acts as the “label” other tokens match against. |
layers[i].self_attn.v_proj.weight |
[d, d] | 16.8 M ร 32 | Projects input into Value vectors. The actual information that gets passed along when a token is attended to. |
layers[i].self_attn.o_proj.weight |
[d, d] | 16.8 M ร 32 | Output projection. Mixes the concatenated outputs of all attention heads back into the model dimension. |
layers[i].mlp.gate_proj.weight |
[4d, d] | 67.1 M ร 32 | SwiGLU gating projection. Decides which features to “turn on” for this token. Stores much of the model’s factual knowledge. |
layers[i].mlp.up_proj.weight |
[4d, d] | 67.1 M ร 32 | SwiGLU up projection. Expands the representation into a higher-dimensional space for non-linear combinations. |
layers[i].mlp.down_proj.weight |
[d, 4d] | 67.1 M ร 32 | Projects back down to model dimension. Combines the gated features into the residual stream. |
layers[i].input_layernorm.weight |
[d] | 4 K ร 32 | RMSNorm scale. Stabilizes activations before attention. Keeps training from blowing up. |
layers[i].post_attention_layernorm.weight |
[d] | 4 K ร 32 | RMSNorm scale before the MLP. Same stabilization job, second time per block. |
norm.weight |
[d] | 4 K | Final RMSNorm applied to the residual stream before the LM head. |
lm_head.weight |
[V, d] | 131 M | Projects final hidden state into vocabulary logits. The “decision layer” that picks the next token. Often weight-tied with embed_tokens. |
rotary_emb.inv_freq |
[d/2] | 2 K | RoPE frequency table. Not trained โ fixed sinusoidal schedule that encodes token position into Q and K. |
๐ฏ Part 9 โ How Are the Weights Set?
We don’t program the weights โ we grow them.
The Simple Version
Imagine you’re blindfolded on a hilly landscape, trying to reach the lowest valley. You feel the slope under your feet and step downhill. Repeat a trillion times. That’s training โ the landscape is the loss function, the slope is the gradient, and your feet are the optimizer.
The Technical Version
Initialization
Weights start as small random numbers (truncated normal, std ~0.02, or โณSRO/Xavier). Bad initialization = dead gradients = training fails.
Forward Pass
A batch of token sequences (e.g., 4M tokens across 8 GPUs) flows through the model, producing logits.
Loss Computation
Cross-entropy loss between predicted logits and actual next tokens: L = -ฮฃ log P(x_t | x_<t ; ฮธ)
Backward Pass
Autograd computes โL/โฮธ โ the gradient with respect to every weight โ via the chain rule.
Optimizer Step
AdamW updates each weight with adaptive learning rate, momentum, and weight decay: ฮธ โ ฮธ - ฮท ยท mฬ / (โvฬ + ฮต) - ฮปยทฮธ
Repeat for Trillions of Tokens
A frontier model trains for ~15T tokens. At 4M-token batches, that’s ~3.75M optimizer steps. On 16,000 H100s, this takes 3-6 months and $50M-$200M.
๐งช Part 10 โ The Training Recipe: Hyperparameters
This is where labs’ “secret sauce” mostly lives. The same architecture and data can produce wildly different models based on hyperparameters.
| Hyperparameter | Typical Value | What It Does | If You Get It Wrong |
|---|---|---|---|
| Learning rate (peak) | 1e-4 to 6e-4 | Step size for weight updates. | Too high: loss diverges. Too low: training crawls. |
| LR schedule | Warmup โ cosine decay | Ramps LR up, then decays to ~10% of peak. | No warmup: early instability. No decay: never converges. |
| Batch size | 2M – 16M tokens | Tokens per optimizer step. | Too small: noisy gradients. Too big: diminishing returns. |
| Weight decay | 0.1 | L2 regularization โ discourages large weights. | Too high: underfitting. Too low: overfitting. |
| Gradient clipping | 1.0 (norm) | Caps gradient magnitude to prevent blowups. | Missing: training collapses on rare bad batches. |
| Z-loss coefficient | 1e-4 to 1e-5 | Auxiliary loss that penalizes large logits (stabilizes softmax). | Missing: logit blowup in late training. |
| Precision | bf16 (training), fp8 (Blackwell) | Numeric format for weights and activations. | fp16: overflow risk. fp32: 2ร slower, 2ร memory. |
| Optimizer | AdamW (standard), Muon, Sophia | Algorithm for weight updates. | SGD: too slow for transformers. Bad Adam config: divergence. |
Finding the right recipe requires hundreds of small proxy runs at 1B scale before committing to a 70B+ run. Labs like Meta, Google, and Anthropic publish papers describing these recipe choices โ but the exact values are often kept private.
๐ Part 11 โ Distributed Systems: Training at Scale
A 70B model in bf16 is ~140 GB. That doesn’t fit on one GPU. A 405B model is ~810 GB. That doesn’t fit on 8 GPUs. Training these models is first and foremost a distributed systems problem.
The Simple Version
Imagine trying to read a 10,000-page encyclopedia with 1,000 friends. You can’t each read the whole thing โ you have to split it up. But you also need to stay in sync, share notes, and make sure nobody’s pages are missing. That’s distributed training.
Parallelism Strategies
| Strategy | What It Splits | When Used | Tradeoff |
|---|---|---|---|
| Data Parallel (DP) | Dataset across GPUs | Always. The baseline. | Each GPU has full model. All-reduce gradients per step. |
| FSDP / ZeRO | Optimizer states, gradients, params | Almost always for models > 7B. | Reduces memory 4-8ร. Adds communication overhead. |
| Tensor Parallel (TP) | Individual weight matrices | Within a single node (NVLink). | Fast interconnect required. Splits each matmul across GPUs. |
| Pipeline Parallel (PP) | Layers across GPUs | Across nodes when model is too big. | Introduces “bubbles” โ GPUs idle part of the time. |
| Sequence Parallel (SP) | Sequence dimension | Very long contexts (128K+). | Reduces activation memory. Complex to implement. |
| Context Parallel (CP) | Attention across GPUs | Million-token contexts. | Ring attention patterns. Used by Gemini for 1M+ context. |
| Expert Parallel (EP) | MoE experts across GPUs | MoE models only. | All-to-all communication per token. Tricky to balance. |
Modern frontier training uses 3D or 4D parallelism โ combining DP + TP + PP + (SP or EP) simultaneously. The Llama 3 405B run used a complex hybrid across 16,000 H100s.
The Engineering Reality
- Fault tolerance: In a 16,000-GPU cluster, a GPU fails every few hours. Training must continue via automatic checkpointing and node replacement.
- Network topology: NVLink within a node (900 GB/s), InfiniBand/RoCE between nodes (400-800 Gbps). Network design matters as much as GPU count.
- Straggler detection: One slow node drags down the whole cluster. Systems must detect and replace stragglers automatically.
- Checkpointing: Saving 1+ TB of state every few hundred steps without stopping training. Async checkpointing to fast storage is critical.
At scale, a single GPU may silently produce a wrong number due to a cosmic ray bit-flip. Without careful validation, this corrupts the entire run. Labs build elaborate detection systems to catch these “silent data corruptions.”
๐งฌ Part 12 โ The Three Phases of Making an LLM Useful
Raw pre-training produces a model that can complete text but won’t chat with you. Three phases turn it into an assistant.
| Phase | Data Used | What Changes | Result |
|---|---|---|---|
| 1. Pre-training | Trillions of tokens: web, books, code, Wikipedia, papers, synthetic | All weights updated. Next-token prediction loss. | Model knows language, facts, reasoning patterns, biases โ but is a raw completion engine. |
| 2. Supervised Fine-Tuning (SFT) | 10Kโ1M high-quality (instruction, response) pairs written by humans or synthetically | All weights (or LoRA adapters) updated. Loss only on response tokens. | Model learns to follow instructions and produce assistant-style outputs. |
| 3. RLHF / DPO / GRPO | Human or AI preference rankings: “Response A is better than B” | Weights nudged to increase probability of preferred responses. | Model becomes helpful, harmless, honest. Refuses dangerous requests. Matches human taste. |
| 4. Rejection Sampling / RLAIF | AI-generated responses filtered by stronger AI judge | Further fine-tuning on verified outputs | Scales alignment without proportional human labeling cost. |
| 5. Test-time compute (2025+) | No new weights โ inference-time search | Model spends more FLOPs on harder problems via chain-of-thought, tree search, or verifier-guided sampling | o1/o3-style reasoning. Same weights, smarter inference. |
๐ก๏ธ Part 13 โ Alignment & Safety Infrastructure
Alignment isn’t a single step โ it’s an infrastructure that spans the entire lifecycle.
The Layers of Safety
- Data-level: Filtering harmful content from pre-training data. Refusals baked into SFT examples.
- Training-level: RLHF with human reviewers; Constitutional AI (Anthropic) where the model critiques its own outputs against a set of principles; GRPO (DeepSeek) for reasoning-focused alignment.
- Mechanistic-level: Interpretability research identifying “refusal neurons” and “deception circuits” in the model. Anthropic, Google DeepMind, and OpenAI’s super-alignment teams publish on this regularly.
- System-level: Input/output guardrails (separate classifiers that filter prompts and responses), rate limiting, monitoring, incident response.
- Red-teaming: Thousands of adversarial humans (and AI red-teamers) trying to break the model before release. Findings feed back into training.
- Evals: Dangerous capability evaluations (bioweapons, cyber, persuasion) run before every release.
We don’t really know how to make a model truly safe at superhuman capability levels. Current techniques work well for today’s models but may not scale. This is why “AI safety” is a massive research area โ and why labs spend hundreds of millions on it.
๐จ Part 14 โ Training Specific Characteristics & Specialties
Want a model that’s great at Python? Or speaks like a pirate? Or diagnoses rare diseases? Here’s how each is achieved.
14.1 โ Specialty: Code Generation
The training mix is re-weighted to 30-50% code tokens (GitHub, StackOverflow, documentation). SFT on curated (instruction, code) pairs with unit tests. Models like DeepSeek-Coder and CodeLlama show that even a small amount of high-quality code SFT dramatically lifts pass@k scores.
14.2 โ Specialty: Medical / Legal / Financial
Continued pre-training on domain-specific corpora (PubMed, case law, SEC filings) at low learning rate (~1e-5) to avoid catastrophic forgetting. SFT on expert-written Q&A pairs. RLHF with expert reviewers for high-stakes outputs.
14.3 โ Personality & Tone
Personality is almost entirely a product of SFT data style and RLHF preferences. The underlying “intelligence” weights barely change โ only the top-layer stylistic projections shift. This is why a single base model can produce both a terse coding assistant and a warm therapy bot with different fine-tunes.
14.4 โ Safety & Refusals
Trained via red-teaming: adversarial humans try to elicit harmful outputs, and those conversations are added to RLHF with “refusal is preferred” labels. The model learns a refusal direction in activation space.
๐๏ธ Part 15 โ Adding Modalities: Vision, Audio, Video
A pure LLM only sees text. To make it multimodal, we add new encoder modules whose outputs are projected into the LLM’s embedding space.
| Modality | New Weights Added | Training Approach | What It Enables |
|---|---|---|---|
| Vision | ViT or SigLIP encoder (~300Mโ2B params) + linear projector | Encoder frozen at first; projector trained on image-caption pairs; then end-to-end fine-tune | Image understanding, OCR, chart reading, visual Q&A |
| Audio / Speech | Whisper-style encoder or discrete audio tokenizer + projector | Train projector on (audio, transcript) pairs; then multi-task SFT | Transcription, speech understanding, audio reasoning |
| Video | Frame sampler + vision encoder + temporal encoder | Same as vision but with frame-order supervision | Action recognition, video summarization, temporal reasoning |
| Image Generation | Separate diffusion/flow-matching decoder; LLM emits special tokens | LLM trained to emit image-token sequences; decoder trained with denoising loss | Unified chat + image generation (GPT-4o, Gemini 2.5 Flash Image) |
| Tool Use | No new weights โ just SFT data with special tokens | SFT on (prompt, tool_call_json, tool_result, answer) trajectories | Model can invoke search, calculators, APIs, sandboxes |
| Robotics | Action head: linear layer [d, action_dim] on top of LLM | SFT on (observation, instruction, action) tuples from robot demos | VLA models like RT-2, ฯโ that turn language into motor commands |
The key insight of modern multimodal architectures (Flamingo, LLaVA, GPT-4o, Gemini) is that the LLM backbone stays the same. We just add encoders that translate other modalities into “tokens” the LLM already understands.
๐ Part 16 โ Evaluation: Measuring What Matters
You can’t improve what you can’t measure. Evaluation infrastructure is as important as training infrastructure โ and often just as expensive.
The Benchmark Ecosystem
| Benchmark | What It Tests | Strengths | Weaknesses |
|---|---|---|---|
| MMLU / MMLU-Pro | 50+ academic subjects | Broad knowledge coverage. Standard for years. | Saturated by frontier models. Contamination risk. |
| GPQA | Graduate-level science Q&A | Hard enough to differentiate frontier models. | Narrow domain. Experts only. |
| HumanEval / MBPP | Python coding problems | Executes code โ ground truth correctness. | Too easy now. Contaminated. |
| LiveBench / LiveCodeBench | Fresh problems updated monthly | Resists contamination. Tests current ability. | Smaller dataset. Less stable over time. |
| MATH / AIME | Competition math | Tests deep reasoning. Clear correctness. | Narrow. Doesn’t test real-world math. |
| Chatbot Arena (LMSYS) | Human preference via blind A/B | Measures what users actually care about. Elo-rated. | Expensive. Subjective. Gaming risk. |
| ARC-AGI | Abstract pattern reasoning | Tests generalization, not memorization. | Very hard. Even frontier models score low. |
| SWE-bench | Real GitHub issue resolution | Tests end-to-end coding ability. | Expensive to run. Contamination risk. |
The Contamination Problem
If your training data contains the test questions, your benchmark scores are meaningless. Labs now use:
- Canary strings embedded in benchmarks to detect leakage
- Contamination detectors that compare training data against benchmark text
- Evergreen benchmarks (LiveBench, LiveCodeBench) that update regularly
- Held-out private evals that never get released
By 2026, every frontier lab has an internal eval suite of 100+ benchmarks. Public benchmarks are necessary but not sufficient โ the real differentiation happens on private evals tailored to the lab’s product goals.
โก Part 17 โ Inference: Making It Usable
A model is useless if it’s too slow or expensive to serve. Inference optimization is a massive subfield โ and in many ways harder than training, because you have to serve millions of concurrent users at low latency.
The Inference Stack
| Technique | What It Does | Speedup / Savings | Tradeoff |
|---|---|---|---|
| KV Cache | Caches keys/values from previous tokens so they aren’t recomputed | Essential. Makes autoregressive generation feasible. | Dominates memory at long contexts. |
| FlashAttention | IO-aware exact attention algorithm | 2-4ร faster attention, less memory | Complex implementation. Now standard. |
| Quantization | Stores weights in lower precision (INT8, INT4, FP8) | 2-4ร memory reduction, 1.5-3ร speedup | Small quality loss. GPTQ, AWQ, GGUF are common. |
| Continuous Batching | Groups requests dynamically instead of static batches | 10-100ร throughput at same latency | Complex scheduler. vLLM, TensorRT-LLM. |
| Speculative Decoding | Small “draft” model proposes tokens; big model verifies | 2-3ร speedup with no quality loss | Needs a good draft model. Implementation complexity. |
| PagedAttention | Virtual memory for KV cache โ non-contiguous allocation | ~2ร more concurrent requests | Core of vLLM. Now industry standard. |
| Prefix Caching | Shares KV cache across requests with same system prompt | Massive savings for multi-tenant serving | Only helps when prefixes are shared. |
| Distillation | Train a small model to mimic a big one | 10-100ร smaller with 80-95% quality | Loses some capabilities. Licensing issues. |
The best inference stacks in 2026 combine all of these. A single served request might go through FlashAttention โ PagedAttention โ continuous batching โ INT4 quantization โ speculative decoding. The result: frontier-model quality at consumer-hardware prices.
๐ ๏ธ Part 18 โ How to Actually Train Your Own (Practical Guide)
You probably aren’t pre-training a 70B model from scratch. But here’s what’s realistic in 2026:
| Goal | Method | Hardware Needed | Cost / Time |
|---|---|---|---|
| Custom chatbot personality | LoRA fine-tune on Llama 3 8B | 1ร RTX 4090 (24 GB) | $0.50/hr ยท a few hours |
| Domain expert (medical, legal) | Continued pre-training + LoRA SFT | 4โ8ร A100 80GB | $150โ500 ยท 1โ3 days |
| Small specialized model from scratch | Pre-train 1Bโ3B on curated corpus | 32โ64ร A100 80GB | $10Kโ50K ยท 1โ3 weeks |
| Frontier-class model | Full pre-training 70B+ | 8,000+ H100 cluster | $50M+ ยท 3โ6 months |
The Practical Stack in 2026
- Framework: PyTorch +
torch.distributed+ FSDP2, or DeepSpeed ZeRO-3 - Tokenizers: SentencePiece or tiktoken, vocab ~100Kโ200K
- Training loop: Hugging Face TRL, Nanotron, litgpt, or torchtitan for larger runs
- Data: Dolma, RedPajama, StarCoderData, DCLM for pre-training; OpenHermes, Magpie, UltraChat for SFT
- Alignment: TRL’s DPOTrainer, Open-Instruct, or TRL’s GRPOTrainer
- Evaluation: lm-evaluation-harness, MMLU, GPQA, LiveBench, Chatbot Arena
- Serving: vLLM, SGLang, TensorRT-LLM
๐ฐ Part 19 โ Cost & Economics
Let’s talk money. Building and running LLMs is expensive, and the economics drive many of the technical decisions.
The Cost Stack
| Cost Category | Frontier Lab (Annual) | Mid-size Lab | Startup / Indie |
|---|---|---|---|
| Pre-training compute | $100M โ $1B+ | $5M โ $50M | $0 (use open weights) |
| GPU cluster (owned) | 100,000+ H100/B200 | 1,000 โ 10,000 | 1 โ 8 consumer GPUs |
| GPU cluster (cloud) | $200M โ $1B+/yr | $10M โ $100M/yr | $1K โ $50K/yr |
| Data licensing | $50M โ $500M | $1M โ $20M | $0 (open data) |
| Human labelers (SFT/RLHF) | $50M โ $200M | $500K โ $10M | $0 (synthetic) |
| Research team | 1,000+ PhDs | 50 โ 500 | 1 โ 10 |
| Inference serving | $500M โ $2B+/yr | $5M โ $50M/yr | $100 โ $50K/mo |
The Economics Shape the Tech
Many technical decisions are driven by cost:
- MoE exists because inference cost matters more than training cost at scale
- Quantization is driven by serving economics โ every GB of memory saved is money
- Distillation lets you serve a 7B model that behaves like a 70B
- Speculative decoding cuts GPU hours, which cuts dollars
- Open weights (Llama, Mistral, Qwen) exist because the training cost is so high that only a few players can afford it โ but everyone can benefit from the outputs
๐ง Part 20 โ Where the Weights “Store” Different Kinds of Knowledge
Recent work in mechanistic interpretability has mapped what different parts of the network actually do:
- Early layers (1โ10): Surface-level features โ parts of speech, simple syntax, local context
- Middle layers (10โ50): The “knowledge layers.” Fact recall, entity disambiguation, induction heads. This is where most factual knowledge lives as sparse features.
- Late layers (50โ80): Task-specific computation, formatting, style. The “last mile” that converts internal representations into next-token distributions.
- Attention heads: Some specialize in previous-token (local copying), some in induction (completing patterns), some in previous noun (coreference), some in doc-start (attention to context beginning).
Models represent far more concepts than they have neurons by encoding them as directions in activation space, not individual neurons. A 7B model may encode millions of concepts in its 4096-dimensional space by using near-orthogonal directions. This is why “finding the neuron for X” usually fails โ X is a direction, not a neuron.
๐ฎ Part 21 โ The Frontier: What’s Next
As of mid-2026, the field is moving fast on several fronts:
- Test-time compute scaling: Models like o3, Claude 4’s extended thinking, and DeepSeek-R1 spend more inference FLOPs on harder problems. The weights are the same โ only the inference path changes.
- Mixture of Experts: Now the default for frontier models. Llama 4, Grok-3, and others use MoE to get 70B+ quality at 10-20B active-parameter inference cost.
- State-space models: Mamba-2, Jamba, and hybrids offer linear-time inference. Still catching up on quality but improving fast.
- Continuous training: Models that never stop learning โ updating weights on a rolling stream of new data without forgetting. The holy grail of lifelong AI.
- Weight merging: Taking two fine-tunes of the same base model and averaging their weights to combine capabilities. Surprisingly effective.
- Agentic capabilities: Tool use, planning, memory, multi-step reasoning. Models that can browse the web, write and execute code, and coordinate with other models.
- Long context: 1M+ token contexts are now standard. 10M+ is being researched. This requires architectural innovations like ring attention and YaRN.
- Process reward models: Instead of just judging the final answer, judging each step of reasoning. Enables better math and coding.
- Multimodal unification: Single models that handle text, images, audio, video, and actions in one conversation. GPT-4o, Gemini 2.5, and Claude 4 are all heading this way.
๐ Part 22 โ The Full Lifecycle: From Idea to Production
Let’s zoom out and see how all these pieces fit together in a real project.
| Phase | Activities | Key Decisions | Deliverables |
|---|---|---|---|
| 1. Research | Literature review, architecture exploration, small-scale experiments | What problem are we solving? What architecture? What scale? | Research plan, architecture spec, proxy run results |
| 2. Data | Collection, cleaning, deduplication, quality filtering, mixing | What data sources? What mix? How much synthetic? | Curated training datasets, data cards, contamination reports |
| 3. Infrastructure | Cluster setup, networking, storage, monitoring, fault tolerance | How many GPUs? What interconnect? What storage? | Production training cluster, monitoring dashboards |
| 4. Pre-training | Large-scale training run, checkpointing, intervention | Learning rate schedule, batch size, when to stop | Base model checkpoint, training logs, loss curves |
| 5. Evaluation | Benchmark runs, contamination checks, human evals | Is the model good enough? What are its weaknesses? | Evaluation report, benchmark scores, failure analysis |
| 6. Post-training | SFT, RLHF/DPO, red-teaming, safety evals | What alignment strategy? What safety thresholds? | Aligned model checkpoint, safety report |
| 7. Optimization | Quantization, distillation, inference optimization | What latency/throughput targets? What quality tradeoffs? | Optimized model, serving config, benchmarks |
| 8. Deployment | Serving infrastructure, API design, monitoring, scaling | How to serve millions of requests? What SLA? | Production API, monitoring, incident response |
| 9. Operation | Monitoring, feedback collection, continuous improvement | When to retrain? What data to collect? | Feedback loops, usage analytics, improvement roadmap |
A frontier model goes through all nine phases over 12-24 months. Each phase has its own team, its own challenges, and its own failures. The weights you download at the end are just the final artifact of this enormous effort.
โ Summary
Building an LLM is not just about weights. It’s about:
- Tokenization โ how you chop text into pieces
- Architecture โ the blueprint of the factory
- Data โ the raw material (and the real secret sauce)
- Scaling laws โ the math of getting bigger
- Weights โ the machines inside the factory
- Training recipe โ the hyperparameters and optimizer
- Distributed systems โ how you train across 16,000 GPUs
- Post-training โ SFT, RLHF, alignment
- Safety โ red-teaming, guardrails, interpretability
- Specialties โ code, medical, personality, modalities
- Evaluation โ measuring what matters
- Inference โ making it fast and cheap to serve
- Economics โ the cost reality that shapes everything
- Operations โ monitoring, feedback, continuous improvement
The weights are the model โ but they’re also the output of an enormous, multi-year engineering effort. Understand the weights, and you understand the intelligence. Understand everything else, and you understand how to build it.

