- Published on
- 25 min read
How Large Language Models Work: The Complete Technical Guide to Transformers, Training, and Inference (2026)
- Authors

- Name
- Dylan Boudro
- https://x.com/StarmorphAI
TL;DR: This is a complete technical walkthrough of how large language models work — from the transformer architecture and self-attention to training dynamics, inference optimization, and the modern innovations that power GPT-4o, Claude, Llama 3, and every other frontier model. Every claim is backed by research papers. Whether you're preparing for ML interviews, building LLM-powered applications, or just want to deeply understand the technology reshaping software, this guide covers the full stack.
This guide synthesizes content from two deep-dive references I created for my own learning: a technical interview prep document and an interactive visual guide. I'm combining them here into a single authoritative resource with thorough sourcing.
Table of Contents
- The Transformer Architecture
- Tokenization
- The Attention Mechanism
- Training: How LLMs Learn
- Key Hyperparameters
- Inference and Decoding
- The KV Cache
- Inference Acceleration
- Fine-Tuning and Alignment
- Scaling Laws
- Modern Innovations: The 2025-2026 LLM Stack
- Putting It All Together
- Sources
The Transformer Architecture
Before 2017, NLP relied on Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. These processed text sequentially — one token at a time, left to right — which meant they were slow (no parallelization) and struggled with long-range dependencies. By the time an RNN reached the end of a long paragraph, it had largely "forgotten" the beginning.
The transformer (Vaswani et al., 2017 — "Attention Is All You Need") solved both problems. Instead of processing tokens sequentially, transformers process the entire input in parallel using self-attention. Every token can directly attend to every other token, regardless of distance.
This single change unlocked massive parallelization on GPUs, direct long-range context modeling, transfer learning at scale, and predictable performance scaling with more data and compute.
Decoder-Only: The Modern LLM
The original transformer had an encoder-decoder structure for machine translation. Modern LLMs — GPT-4o, Claude, Llama, Gemini, Mistral — all use only the decoder half with causal masking. No encoder, no cross-attention.
Each decoder block contains:
- RMSNorm (pre-normalization) — stabilizes activations before each sub-layer
- Masked Multi-Head Attention with GQA and RoPE — contextual mixing between tokens
- Residual connection — skip connection around the attention sub-layer
- RMSNorm — pre-normalization before the FFN
- Feed-Forward Network with SwiGLU — position-wise transformation that stores factual knowledge
- Residual connection — skip connection around the FFN
A modern LLM stacks 32-126 of these blocks. Llama 3 8B has 32 layers, Llama 3 70B has 80, and Llama 3 405B has 126 (Grattafiori et al., 2024).
Residual Connections
Skip connections around every sub-layer solve three problems:
- Gradient flow — gradients can flow directly backward through the network, enabling training of 100+ layer models without vanishing gradients (He et al., 2015)
- Residual learning — each layer learns a delta rather than a complete new representation
- Identity path — earlier layer outputs can pass through unchanged if a later layer's weights are near-zero
The formula in a modern pre-norm transformer:
output = x + SubLayer(Norm(x))
Pre-norm (normalizing before the sub-layer) produces more stable training dynamics than post-norm, especially for deep models. Nearly all modern LLMs use pre-norm (Xiong et al., 2020).
Tokenization
LLMs don't process text directly — they operate on sequences of integer token IDs from a fixed vocabulary. All modern LLMs use subword tokenization, which balances three competing concerns:
| Level | Vocabulary Size | Sequence Length | Handles Unknown Words? |
|---|---|---|---|
| Character | ~256 | Very long | Yes |
| Subword | 32K–128K | Moderate | Yes |
| Word | 100K+ | Short | No |
Byte Pair Encoding (BPE)
BPE (Sennrich et al., 2016) is the most widely used tokenization algorithm. It starts with individual characters and iteratively merges the most frequent adjacent pair into a new token until the desired vocabulary size is reached.
Used by: GPT-2, GPT-3, GPT-4, Llama (via SentencePiece)
The key property is determinism — BPE saves the merge rules, enabling consistent encoding of new text at inference time.
SentencePiece
SentencePiece (Kudo & Richardson, 2018) treats input as raw bytes with no language-specific pre-tokenization. It can use either BPE or Unigram as the underlying algorithm. This makes it language-agnostic — critical for multilingual models.
Used by: Llama, T5, mBART
Vocabulary Size Matters
| Model | Vocab Size | Tokens/Param Ratio Impact |
|---|---|---|
| GPT-2 | 50,257 | Baseline |
| GPT-4 | ~100K | Better multilingual coverage |
| Llama 2 | 32K | Optimal for 7B params |
| Llama 3 | 128K | 4x increase — crucial for code and multilingual |
Llama 3's jump from 32K to 128K vocabulary was one of its most impactful changes. Larger vocabulary means fewer tokens per input, which means more content fits in the context window. The embedding table grows, but this is a small fraction of total parameters for a 70B+ model (Grattafiori et al., 2024).
Optimal vocabulary size scales with model size — research shows Llama 2's 32K vocabulary was optimal for 7B parameters, but the 70B model would have benefited from ~216K tokens (Tao et al., 2024).
The Attention Mechanism
Attention is the core innovation of the transformer. It lets every token "look at" every other token and decide which are relevant.
Scaled Dot-Product Attention
The formula:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) × V
Step by step:
- Project input X through learned weight matrices to get Q (queries), K (keys), V (values)
- Compute scores — dot product of Q and K gives raw relevance scores
- Scale — divide by sqrt(d_k) to prevent softmax saturation
- Mask — set future positions to -infinity (causal masking for autoregressive generation)
- Softmax — convert scores to a probability distribution (rows sum to 1)
- Weighted sum — multiply attention weights by V to produce context-enriched representations
Why scale by sqrt(d_k)? When d_k is large (e.g., 128), dot products grow in magnitude — their variance scales proportionally with d_k. Large values push softmax into near-one-hot distributions with vanishing gradients. Scaling keeps variance at ~1, maintaining healthy gradient flow (Vaswani et al., 2017).
Multi-Head Attention
Rather than one attention computation, the model runs h parallel attention heads, each with independently learned Q, K, V projections operating on a subspace of dimension d_head = d_model / h:
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) × W^O
where head_i = Attention(X × W_i^Q, X × W_i^K, X × W_i^V)
Each head can learn different relationship types — one for syntactic dependencies, another for semantic similarity, another for coreference chains. The original transformer used 8 heads; modern LLMs use 32-128. The total parameter cost equals single-head attention with full dimension.
Causal Masking
In decoder-only LLMs, tokens can only attend to previous tokens (and themselves). This is enforced by a lower-triangular mask:
Mask = [[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 1, 1, 0],
[1, 1, 1, 1]]
Positions where mask = 0 are set to -infinity before softmax, making their attention weight effectively zero. This ensures each position can only use past information, matching the autoregressive generation pattern at inference time.
Computational Complexity
| Operation | Time | Space |
|---|---|---|
| QK^T (attention scores) | O(n²d) | O(n²) |
| Softmax | O(n²) | O(n²) |
| Attention × V | O(n²d) | O(nd) |
| Total | O(n²d) | O(n²) |
The O(n²) scaling with sequence length is the fundamental bottleneck. For 8K context, the attention matrix has 64M entries per head. For 128K, it's 16B entries. This drives the need for Flash Attention and other optimizations.
Training: How LLMs Learn
Pretraining: Next-Token Prediction
LLMs are pretrained on massive text corpora using a deceptively simple objective: predict the next token given all previous tokens (causal language modeling).
L = -1/N × Σ log P(t_i | t_1, ..., t_{i-1})
This cross-entropy loss measures how "surprised" the model is by the actual next token. Lower loss = better predictions. Perplexity is the exponentiated loss: PPL = exp(L). A perplexity of 10 means the model is as uncertain as if choosing uniformly among 10 tokens.
The entire sequence is processed in parallel during training, with causal masking ensuring each position only sees past tokens — matching inference behavior.
Optimization
AdamW (Loshchilov & Hutter, 2019) is the standard optimizer for LLM training. It provides adaptive per-parameter learning rates (via moving averages of gradients and squared gradients), momentum for smoothing noisy gradients, and decoupled weight decay for regularization.
Training typically uses:
- Learning rate warmup — gradual increase over 1-10% of total steps, allowing Adam's running averages to stabilize before large updates
- Cosine decay — learning rate gradually decreases to ~10% of peak after warmup
- Gradient clipping — max norm of 1.0 to prevent training instability
Mixed Precision Training
Modern training uses BF16 (bfloat16) for forward and backward passes while maintaining FP32 master weights (Micikevicius et al., 2018):
- Store master weights in FP32
- Cast to BF16 for forward pass
- Compute loss in FP32 (numerical stability)
- Backward pass in BF16
- Update master weights in FP32
BF16 is preferred over FP16 because it has the same exponent range as FP32 (preventing overflow). This yields ~2x memory reduction for activations and ~2x faster matrix multiplication on Tensor Cores.
Distributed Training
Training models with hundreds of billions of parameters requires distributing computation across thousands of GPUs. The main parallelism strategies:
Data Parallelism — Each GPU holds a complete model copy, processes different minibatches, and gradients are averaged via all-reduce. Simple but limited by per-GPU memory.
Tensor Parallelism (TP) — Individual weight matrices are split across GPUs. Requires high-bandwidth NVLink interconnect. Typically used within a single node (8 GPUs). (Shoeybi et al., 2020)
Pipeline Parallelism (PP) — Different layers assigned to different GPUs. Micro-batching fills the pipeline to minimize idle time. Can span across nodes.
Fully Sharded Data Parallelism (FSDP/ZeRO) — Shards model parameters, gradients, and optimizer states across all GPUs. Parameters are gathered on-demand for computation, then resharded. (Rajbhandari et al., 2020)
Llama 3 405B was trained with 8-way tensor parallelism within each node, 16-way pipeline parallelism across nodes, and data parallelism across node groups (Grattafiori et al., 2024).
Key Hyperparameters
Architecture Parameters
| Parameter | Description | 7-8B | 70B | 405B |
|---|---|---|---|---|
| Layers (L) | Transformer blocks | 32 | 80 | 126 |
| Hidden dim (d_model) | Residual stream width | 4096 | 8192 | 16384 |
| Attention heads (H) | Query heads | 32 | 64 | 128 |
| KV heads | Key-value heads (GQA) | 8 | 8 | 8 |
| Head dim (d_head) | d_model / H | 128 | 128 | 128 |
| FFN dim | Inner FFN dimension | ~11K | ~22K | ~44K |
| Vocab size | Token vocabulary | 128K | 128K | 128K |
| Context length | Max sequence | 8K | 8K | 8K |
Values from Llama 3 (Grattafiori et al., 2024)
Key Relationships
- d_head = d_model / num_heads — head dimension is derived, not independent
- FFN dim ≈ 8/3 × d_model — for SwiGLU (vs 4x for vanilla ReLU)
- Total params ≈ 12 × L × d_model² — rough approximation for decoder-only models
- Training tokens ≈ 20 × params — Chinchilla compute-optimal (though modern models exceed this)
Training Parameters
| Parameter | Typical Value | Purpose |
|---|---|---|
| Peak learning rate | 1e-4 to 3e-4 | Step size for weight updates |
| Warmup | 2000-4000 steps | Stabilize optimizer statistics |
| LR schedule | Cosine decay | Gradual reduction to ~10% of peak |
| Batch size | Millions of tokens | Ramps up during training |
| Weight decay | 0.1 | L2 regularization via AdamW |
| Dropout | 0.0 | Not needed for large-scale pretraining |
| Gradient clipping | 1.0 | Prevent training instability |
Why zero dropout? Large models trained on massive datasets are in an under-fitting regime — there's so much data that overfitting isn't the primary concern. Dropout would reduce effective capacity. Weight decay alone provides sufficient regularization (Grattafiori et al., 2024).
Inference and Decoding
Autoregressive Generation
LLMs generate text one token at a time:
- Prefill phase — process the entire prompt through the model (compute-bound)
- Sample/select the next token from the output probability distribution
- Append the token to the sequence
- Decode phase — process the new token (memory-bandwidth-bound, uses KV cache)
- Repeat steps 2-4 until a stop condition (EOS token, max length)
Decoding Strategies
Greedy decoding — always pick the highest-probability token. Deterministic and fast but often repetitive and boring. Locally optimal choices don't guarantee globally optimal sequences.
Beam search — maintain k candidate sequences. Better for tasks with a "right answer" (translation, summarization) but still deterministic and tends toward generic outputs. Rarely used for open-ended generation.
Temperature scaling adjusts the "sharpness" of the probability distribution:
P(t_i) = exp(logit_i / T) / Σ exp(logit_j / T)
| Temperature | Effect |
|---|---|
| T → 0 | Equivalent to greedy (argmax) |
| T < 1 | Sharper, more deterministic |
| T = 1 | Original distribution |
| T > 1 | Flatter, more random/creative |
Top-p (nucleus) sampling (Holtzman et al., 2020) — sort tokens by probability, accumulate until the sum exceeds p, truncate, renormalize, and sample. This dynamically adjusts the candidate set: confident predictions yield a small set, uncertain predictions yield a larger one. Superior to fixed top-k because it adapts to the model's confidence.
Min-p sampling — scales the threshold proportional to the model's maximum probability: threshold = min_p × max_probability. Preserves coherence even at high temperatures.
Common Parameter Combinations
| Use Case | Temperature | Top-p |
|---|---|---|
| Code generation | 0.0–0.2 | — |
| Factual Q&A | 0.0–0.3 | 0.9 |
| Creative writing | 0.7–1.0 | 0.95 |
| Brainstorming | 1.0–1.2 | 0.95 |
The KV Cache
Why It Exists
During autoregressive generation, self-attention at each step needs Keys and Values from all previous tokens. Without caching, generating n tokens requires O(n²) total token computations — reprocessing the entire growing sequence at every step.
The KV cache stores previously computed K and V vectors:
Step 1: Process [1,2,3,4,5] → Store K₁..₅, V₁..₅
Step 2: Process [6] only → Compute K₆, V₆, attend using cached K₁..₆, V₁..₆
Step 3: Process [7] only → Compute K₇, V₇, attend using cached K₁..₇, V₁..₇
This reduces generation from O(n²) to O(n) token computations after the initial prefill.
Memory Cost
KV cache memory per token per layer:
memory = 2 × num_kv_heads × d_head × bytes_per_element
For Llama 3 8B in BF16: 32 layers × 8 KV heads × 128 d_head × 2 bytes × 2 (K+V) = 128 KB per token. At 8K context, that's 1 GB. At 128K context, 16 GB.
For Llama 3 70B: 80 layers × 8 KV heads × 128 d_head × 2 bytes × 2 = 320 KB per token. At 8K context, 2.5 GB.
The KV cache can consume up to 70% of total GPU memory during inference and grows linearly with sequence length and batch size.
Two Phases of Inference
- Prefill — processes the entire prompt at once. Compute-bound (matrix multiplications dominate).
- Decode — generates tokens one by one. Memory-bandwidth-bound (reading the KV cache from HBM dominates). The GPU spends most time reading gigabytes of cached data to do very little math.
This is why inference latency is dominated by the decode phase, and why memory bandwidth (not FLOPS) is the primary bottleneck for LLM serving.
KV Cache Compression
Several techniques reduce KV cache footprint:
- Quantization — compress KV entries from BF16 to INT8 or INT4. KVQuant achieves up to 10M token context via aggressive KV quantization (Hooper et al., 2024)
- Token eviction — keep only the most-attended tokens. H2O (Heavy Hitter Oracle) identifies and retains high-attention tokens (Zhang et al., 2023)
- Attention sinks — always retain the first few tokens, which models disproportionately attend to regardless of content (Xiao et al., 2024)
Inference Acceleration
Quantization
Reduces numerical precision of model weights to decrease memory footprint and increase throughput.
| Method | Target | Quality | Speed | Use Case |
|---|---|---|---|---|
| GPTQ | NVIDIA GPU | Good (~90%) | Fastest (w/ Marlin) | Production GPU serving |
| AWQ | NVIDIA GPU | Best (~95%) | Fast (w/ Marlin) | Quality-sensitive GPU serving |
| GGUF | CPU/Apple/GPU | Good (~92%) | Moderate | Local inference, edge devices |
| FP8 | H100/H200 | Excellent | Very fast | Native H100+ support |
GPTQ (Frantar et al., 2023) — uses second-order (Hessian) information to determine which weights are most sensitive to quantization. Minimizes output error layer by layer using a calibration dataset.
AWQ (Lin et al., 2024) — identifies the ~1% of weights that are disproportionately important based on activation patterns and protects them during quantization. Near-FP16 quality at INT4.
GGUF / llama.cpp — CPU-friendly quantization format supporting 1.5-bit to 8-bit precision. Q4_K_M is the sweet spot: ~70% smaller files, 90-95% quality retention. Runs on Apple Silicon (Metal), x86 (AVX2/AVX512), NVIDIA (CUDA), and AMD (HIP). See our local LLM inference guide for detailed comparisons.
Flash Attention
An IO-aware, exact attention algorithm that eliminates the memory bottleneck of standard attention (Dao et al., 2022).
The problem: standard attention materializes the full N×N attention matrix in HBM (slow global GPU memory). Each element is written then read back multiple times.
Flash Attention fixes this by:
- Tiling — splits Q, K, V into blocks that fit in SRAM (fast on-chip memory)
- Incremental softmax — computes softmax across blocks using running statistics (online softmax trick)
- No materialization — never writes the full N×N matrix to HBM
- Recomputation — in the backward pass, recomputes attention on-chip instead of reading it
The result: up to 7.6x faster than standard attention, with linear memory in sequence length (vs quadratic). Crucially, Flash Attention is exact — it computes mathematically identical results to standard attention. It's purely an IO optimization.
Flash Attention 2 (Dao, 2023) improved work partitioning across GPU threads, reducing synchronization overhead for another ~2x speedup.
Speculative Decoding
Uses a small "draft" model to predict multiple tokens, verified in parallel by the large "target" model (Leviathan et al., 2023):
- Draft model generates K candidates quickly (K = 5-8)
- Target model verifies all K in a single forward pass
- Accepted tokens match what the target would have generated
- On first mismatch, reject that token and all subsequent
- Target generates the correct token for the rejected position
Lossless — the verification step guarantees the output is statistically identical to the target model alone. Typical speedup: 2-3x. EAGLE achieves ~80% acceptance rate with up to 3.6x speedup (Li et al., 2024).
Continuous Batching and PagedAttention
Continuous batching (Yu et al., 2022, Orca) changes the batch composition at every decode step. When a request finishes, it's immediately replaced by a new one. This eliminates idle GPU cycles from static batching, achieving up to 36.9x higher throughput.
PagedAttention (Kwon et al., 2023, vLLM) manages KV cache memory like an OS manages virtual memory — fixed-size pages allocated dynamically, non-contiguous in physical memory. This eliminates KV cache fragmentation and enables memory sharing between requests with common prefixes (e.g., shared system prompts).
Both techniques are now standard in production serving frameworks like vLLM, TensorRT-LLM, and TGI.
Fine-Tuning and Alignment
The Three Stages
A raw pretrained transformer is a powerful text predictor but a poor assistant. Post-training transforms it into something useful:
Stage 1: Supervised Fine-Tuning (SFT) — train on curated (instruction, response) pairs. The model learns formatting, instruction following, and conversational patterns. Key work: FLAN (Wei et al., 2022), InstructGPT (Ouyang et al., 2022).
Stage 2: Preference Optimization — align the model with human preferences using either RLHF or DPO.
Stage 3: Safety Training — additional alignment for harmlessness. Anthropic's Constitutional AI (Bai et al., 2022) uses AI-generated feedback (RLAIF) based on predefined principles rather than relying entirely on human annotators.
RLHF vs DPO
RLHF (Reinforcement Learning from Human Feedback) is a three-phase process:
- Collect human preference data (rank multiple responses per prompt)
- Train a reward model to predict preferences
- Use PPO to optimize the LLM against the reward model, with a KL penalty to prevent divergence
DPO (Direct Preference Optimization) (Rafailov et al., 2023) eliminates the reward model entirely by directly optimizing the LLM using preference pairs in a single training stage. The key insight: the optimal RLHF policy can be derived in closed form as a function of the preference data.
DPO advantages: simpler pipeline (one stage vs three), more stable training (no PPO instability), faster convergence, lower compute cost, and empirically matches or exceeds RLHF quality. DPO has become the preferred alignment method as of 2025.
LoRA and QLoRA
LoRA (Low-Rank Adaptation) (Hu et al., 2021) exploits the fact that weight updates during fine-tuning occupy a low-rank subspace:
W' = W + ΔW = W + BA
where B is (d × r), A is (r × d), r << d
For d=4096, r=16: 131K trainable params vs 16.7M for full fine-tuning — a 128x reduction. B is initialized to zero so the model starts identical to the pretrained version. At inference, the adapter merges into the base weights with zero latency overhead.
QLoRA (Dettmers et al., 2023) combines 4-bit NormalFloat quantization of base weights with BF16 LoRA adapters. This enables fine-tuning a 65B parameter model on a single 48GB GPU.
Catastrophic Forgetting
Fine-tuning on new data can destroy pretrained capabilities. Mitigation strategies:
- Mix general data with task-specific data during fine-tuning
- Use lower learning rates (1e-5 to 5e-5 vs 1e-4 for pretraining)
- LoRA (freeze base weights, though not fully immune)
- Elastic Weight Consolidation — penalize changes to important weights (Kirkpatrick et al., 2017)
Scaling Laws
Chinchilla: The Compute-Optimal Ratio
Hoffmann et al. (2022) trained 400+ models and discovered the compute-optimal relationship:
D_optimal ≈ 20 × N
For every doubling of model size, training tokens should also double. This proved GPT-3 (175B params, 300B tokens) was severely undertrained — Chinchilla-optimal would have been 3.5T tokens. Chinchilla (70B params, 1.4T tokens) outperformed the 4x larger Gopher by being compute-optimally trained.
Beyond Chinchilla: Inference-Optimal Training
Modern practice exceeds the Chinchilla ratio because inference cost matters more than training cost:
| Model | Params | Training Tokens | Tokens/Param |
|---|---|---|---|
| Chinchilla | 70B | 1.4T | 20x |
| Llama 2 | 70B | 2T | 29x |
| Llama 3 8B | 8B | 15T | 1,875x |
| Llama 3 70B | 70B | 15T | 214x |
A smaller model trained on more data can match a larger model's quality while being cheaper to serve. Training happens once; inference happens millions of times. This is sometimes called "inference-optimal" training — spend more on training to get a smaller, cheaper-to-deploy model.
Emergent Abilities
Certain capabilities appear to "emerge" suddenly at scale rather than improving gradually — few-shot arithmetic around 10B+ parameters, chain-of-thought reasoning around 100B+ (Wei et al., 2022).
Controversy: Schaeffer et al. (2023) argue emergence is an artifact of evaluation metrics. When measured with continuous metrics (token-level accuracy) instead of discontinuous ones (exact match), improvements are typically smooth. The practical takeaway: scaling reliably improves capabilities, but whether a specific benchmark shows a "jump" depends on the evaluation methodology.
Modern Innovations: The 2025-2026 LLM Stack
Nearly every frontier model uses this combination. These aren't optional upgrades — they're the standard architecture.
RoPE (Rotary Positional Embeddings)
Su et al. (2022) — used by Llama, Mistral, Qwen, Gemma, and nearly all modern LLMs.
Self-attention is permutation-invariant — it has no concept of token order. RoPE injects positional information by applying position-dependent rotations to Q and K vectors before computing attention. For position m, each pair of features is rotated by angle m × θ:
θ_i = base^(-2i/d) where base = 10000
The key mathematical property: the dot product of rotated Q at position m and K at position n depends only on the relative distance (m-n), naturally encoding relative position.
Advantages over sinusoidal/learned positional encodings:
- Relative position is directly in the attention scores
- No additional learnable parameters
- Better length generalization with extensions like YaRN (Peng et al., 2023)
- Compatible with Flash Attention
GQA (Grouped Query Attention)
Ainslie et al. (2023) — used by Llama 3, Mistral, Gemma, Qwen.
Standard multi-head attention (MHA) gives every head its own K and V — expensive to store. Multi-Query Attention (MQA) uses a single K,V for all heads — too aggressive, degrades quality.
GQA is the middle ground: multiple query heads share the same K,V heads.
| Type | Q Heads | KV Heads | KV Cache Size |
|---|---|---|---|
| MHA | H | H | 1x |
| GQA | H | G (G < H) | G/H of MHA |
| MQA | H | 1 | 1/H of MHA |
Llama 3 uses 32 query heads and 8 KV heads — every 4 query heads share 1 KV head pair. This gives a 4x reduction in KV cache with minimal quality loss compared to full MHA.
SwiGLU Activation
Shazeer (2020) — used by Llama, Mistral, PaLM, Gemma.
Replaces the standard ReLU FFN with a gated activation:
Standard: FFN(x) = ReLU(xW₁) × W₂ (2 weight matrices)
SwiGLU: FFN(x) = (Swish(xW₁) ⊙ xW₂) × W₃ (3 weight matrices)
The element-wise product (gating) lets the network selectively pass information — one branch determines what passes through. This is strictly more expressive than ReLU. To maintain parameter parity with the extra matrix, the inner dimension is reduced from 4×d_model to 8/3×d_model.
Empirically, SwiGLU produces better perplexity per FLOP than ReLU or GELU.
RMSNorm
Zhang & Sennrich (2019) — used by Llama, Mistral, Gemma, Qwen.
Simplifies Layer Normalization by dropping mean centering:
LayerNorm(x) = γ × (x - mean(x)) / sqrt(var(x) + ε) + β
RMSNorm(x) = γ × x / sqrt(mean(x²) + ε)
~15% faster, fewer parameters (no β), and empirically matches LayerNorm quality for transformers.
Mixture of Experts (MoE)
Used by: Mixtral (Jiang et al., 2024), DeepSeek, Llama 4 Scout
Replace the single FFN per block with multiple "expert" FFNs and a router:
Standard: x → FFN(x) → output
MoE: x → Router(x) → select top-k experts → Σ(expert_i(x)) → output
Mixtral 8x7B has 46.7B total parameters but only 12.9B active per token (top-2 routing). It matches Llama 2 70B quality with 3-4x less compute per token.
The tradeoff: MoE models require more total memory (all experts loaded) but use less compute per token, making them faster and cheaper at inference time.
Putting It All Together
Here's the complete pipeline — from text to prediction — in a modern LLM:
- Tokenize — BPE/SentencePiece splits text into subword tokens mapped to integer IDs
- Embed — look up each token ID in the embedding matrix (vocab_size × d_model)
- Apply RoPE — encode position information via rotation of Q, K vectors
- Process through N decoder blocks, each containing:
- RMSNorm → GQA attention (with causal mask and KV cache) → residual
- RMSNorm → SwiGLU FFN → residual
- Final RMSNorm → linear projection to vocabulary → softmax
- Decode — sample next token via temperature + top-p, append to KV cache, repeat
At training time, the full sequence is processed in parallel with causal masking. At inference time, the prefill phase processes the prompt in one pass, then the decode phase generates tokens one at a time using the KV cache.
Quick Reference: Key Numbers
| Metric | Value |
|---|---|
| Transformer paper | Vaswani et al., 2017 |
| Chinchilla optimal ratio | 20 tokens per parameter |
| Llama 3 8B training data | 15T tokens (1,875x Chinchilla) |
| Flash Attention speedup | Up to 7.6x over standard |
| LoRA parameter reduction | ~100x fewer trainable params |
| QLoRA memory savings | Fine-tune 65B on single 48GB GPU |
| Speculative decoding speedup | 2-3x typical, up to 3.6x |
| Continuous batching improvement | Up to 36.9x throughput |
| KV cache share of GPU memory | Up to 70% during inference |
| GQA KV cache reduction (Llama 3) | 4x (32Q/8KV) |
| Attention complexity | O(n²d) time, O(n²) space |
| RMSNorm speedup over LayerNorm | ~15% |
| Mixtral active vs total params | 12.9B / 46.7B |
Sources
Research Papers
arXivAttention Is All You Need — Vaswani et al. (2017)arXivThe Llama 3 Herd of Models — Grattafiori et al. (2024)arXivTraining Compute-Optimal Large Language Models (Chinchilla) — Hoffmann et al. (2022)arXivFlashAttention: Fast and Memory-Efficient Exact Attention — Dao et al. (2022)arXivFlashAttention-2: Faster Attention with Better Parallelism — Dao (2023)arXivDirect Preference Optimization (DPO) — Rafailov et al. (2023)arXivLoRA: Low-Rank Adaptation of Large Language Models — Hu et al. (2021)arXivQLoRA: Efficient Finetuning of Quantized LLMs — Dettmers et al. (2023)arXivTraining Language Models to Follow Instructions with Human Feedback (InstructGPT/RLHF) — Ouyang et al. (2022)arXivConstitutional AI: Harmlessness from AI Feedback — Bai et al. (2022)arXivRoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE) — Su et al. (2022)arXivGQA: Training Generalized Multi-Query Transformer Models — Ainslie et al. (2023)arXivGLU Variants Improve Transformer (SwiGLU) — Shazeer (2020)arXivRoot Mean Square Layer Normalization (RMSNorm) — Zhang & Sennrich (2019)arXivMixtral of Experts — Jiang et al. (2024)arXivEfficient Memory Management for LLM Serving with PagedAttention (vLLM) — Kwon et al. (2023)arXivOrca: A Distributed Serving System for Transformer-Based Models — Yu et al. (2022)arXivFast Inference from Transformers via Speculative Decoding — Leviathan et al. (2023)arXivEAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Li et al. (2024)arXivGPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers — Frantar et al. (2023)arXivAWQ: Activation-Aware Weight Quantization — Lin et al. (2024)arXivKVQuant: Towards 10M Context Length via KV Cache Quantization — Hooper et al. (2024)arXivH2O: Heavy-Hitter Oracle for Efficient KV Cache — Zhang et al. (2023)arXivEfficient Streaming Language Models with Attention Sinks — Xiao et al. (2024)arXivThe Curious Case of Neural Text Degeneration (Nucleus Sampling) — Holtzman et al. (2020)arXivNeural Machine Translation of Rare Words with Subword Units (BPE) — Sennrich et al. (2016)arXivSentencePiece: A Simple and Language Independent Tokenizer — Kudo & Richardson (2018)arXivDeep Residual Learning for Image Recognition — He et al. (2015)arXivOn Layer Normalization in the Transformer Architecture (Pre-Norm) — Xiong et al. (2020)arXivDecoupled Weight Decay Regularization (AdamW) — Loshchilov & Hutter (2019)arXivMixed Precision Training — Micikevicius et al. (2018)arXivMegatron-LM: Training Multi-Billion Parameter Language Models (Tensor Parallelism) — Shoeybi et al. (2020)arXivZeRO: Memory Optimizations Toward Training Trillion Parameter Models — Rajbhandari et al. (2020)arXivEmergent Abilities of Large Language Models — Wei et al. (2022)arXivAre Emergent Abilities of Large Language Models a Mirage? — Schaeffer et al. (2023)arXivYaRN: Efficient Context Window Extension of Large Language Models — Peng et al. (2023)arXivFinetuned Language Models Are Zero-Shot Learners (FLAN) — Wei et al. (2022)arXivOvercoming Catastrophic Forgetting (EWC) — Kirkpatrick et al. (2017)arXivScaling Laws for Vocabulary Size in LLMs — Tao et al. (2024)Additional References
You might also like
Transformers Explained: From Attention Mechanism to GPT-4o, Claude, and Open-Source LLMs (2026)
April 4, 2023 · 21 min read
LLM Model Names Decoded: A Developer's Guide to Parameters, Quantization & Formats
April 5, 2026 · 30 min read
Local LLM Inference in 2026: The Complete Guide to Tools, Hardware & Open-Weight Models
March 21, 2026 · 26 min read
