Published on
25 min read

How Large Language Models Work: The Complete Technical Guide to Transformers, Training, and Inference (2026)

Authors

TL;DR: This is a complete technical walkthrough of how large language models work — from the transformer architecture and self-attention to training dynamics, inference optimization, and the modern innovations that power GPT-4o, Claude, Llama 3, and every other frontier model. Every claim is backed by research papers. Whether you're preparing for ML interviews, building LLM-powered applications, or just want to deeply understand the technology reshaping software, this guide covers the full stack.

This guide synthesizes content from two deep-dive references I created for my own learning: a technical interview prep document and an interactive visual guide. I'm combining them here into a single authoritative resource with thorough sourcing.

> How LLMs Work: Premium Report
Get the 24-page PDF with 5 exclusive sections — Model Comparison Matrix, Parameter Calculation Worksheets, ML Interview Cheat Sheet, Annotated Paper Reading List, and 50+ term Glossary.
[Get the Premium Report — $19]

Table of Contents

The Transformer Architecture

Before 2017, NLP relied on Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. These processed text sequentially — one token at a time, left to right — which meant they were slow (no parallelization) and struggled with long-range dependencies. By the time an RNN reached the end of a long paragraph, it had largely "forgotten" the beginning.

The transformer (Vaswani et al., 2017 — "Attention Is All You Need") solved both problems. Instead of processing tokens sequentially, transformers process the entire input in parallel using self-attention. Every token can directly attend to every other token, regardless of distance.

This single change unlocked massive parallelization on GPUs, direct long-range context modeling, transfer learning at scale, and predictable performance scaling with more data and compute.

Decoder-Only: The Modern LLM

The original transformer had an encoder-decoder structure for machine translation. Modern LLMs — GPT-4o, Claude, Llama, Gemini, Mistral — all use only the decoder half with causal masking. No encoder, no cross-attention.

Each decoder block contains:

  1. RMSNorm (pre-normalization) — stabilizes activations before each sub-layer
  2. Masked Multi-Head Attention with GQA and RoPE — contextual mixing between tokens
  3. Residual connection — skip connection around the attention sub-layer
  4. RMSNorm — pre-normalization before the FFN
  5. Feed-Forward Network with SwiGLU — position-wise transformation that stores factual knowledge
  6. Residual connection — skip connection around the FFN

A modern LLM stacks 32-126 of these blocks. Llama 3 8B has 32 layers, Llama 3 70B has 80, and Llama 3 405B has 126 (Grattafiori et al., 2024).

Residual Connections

Skip connections around every sub-layer solve three problems:

  1. Gradient flow — gradients can flow directly backward through the network, enabling training of 100+ layer models without vanishing gradients (He et al., 2015)
  2. Residual learning — each layer learns a delta rather than a complete new representation
  3. Identity path — earlier layer outputs can pass through unchanged if a later layer's weights are near-zero

The formula in a modern pre-norm transformer:

output = x + SubLayer(Norm(x))

Pre-norm (normalizing before the sub-layer) produces more stable training dynamics than post-norm, especially for deep models. Nearly all modern LLMs use pre-norm (Xiong et al., 2020).

Tokenization

LLMs don't process text directly — they operate on sequences of integer token IDs from a fixed vocabulary. All modern LLMs use subword tokenization, which balances three competing concerns:

LevelVocabulary SizeSequence LengthHandles Unknown Words?
Character~256Very longYes
Subword32K–128KModerateYes
Word100K+ShortNo

Byte Pair Encoding (BPE)

BPE (Sennrich et al., 2016) is the most widely used tokenization algorithm. It starts with individual characters and iteratively merges the most frequent adjacent pair into a new token until the desired vocabulary size is reached.

Used by: GPT-2, GPT-3, GPT-4, Llama (via SentencePiece)

The key property is determinism — BPE saves the merge rules, enabling consistent encoding of new text at inference time.

SentencePiece

SentencePiece (Kudo & Richardson, 2018) treats input as raw bytes with no language-specific pre-tokenization. It can use either BPE or Unigram as the underlying algorithm. This makes it language-agnostic — critical for multilingual models.

Used by: Llama, T5, mBART

Vocabulary Size Matters

ModelVocab SizeTokens/Param Ratio Impact
GPT-250,257Baseline
GPT-4~100KBetter multilingual coverage
Llama 232KOptimal for 7B params
Llama 3128K4x increase — crucial for code and multilingual

Llama 3's jump from 32K to 128K vocabulary was one of its most impactful changes. Larger vocabulary means fewer tokens per input, which means more content fits in the context window. The embedding table grows, but this is a small fraction of total parameters for a 70B+ model (Grattafiori et al., 2024).

Optimal vocabulary size scales with model size — research shows Llama 2's 32K vocabulary was optimal for 7B parameters, but the 70B model would have benefited from ~216K tokens (Tao et al., 2024).

The Attention Mechanism

Attention is the core innovation of the transformer. It lets every token "look at" every other token and decide which are relevant.

Scaled Dot-Product Attention

The formula:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) × V

Step by step:

  1. Project input X through learned weight matrices to get Q (queries), K (keys), V (values)
  2. Compute scores — dot product of Q and K gives raw relevance scores
  3. Scale — divide by sqrt(d_k) to prevent softmax saturation
  4. Mask — set future positions to -infinity (causal masking for autoregressive generation)
  5. Softmax — convert scores to a probability distribution (rows sum to 1)
  6. Weighted sum — multiply attention weights by V to produce context-enriched representations

Why scale by sqrt(d_k)? When d_k is large (e.g., 128), dot products grow in magnitude — their variance scales proportionally with d_k. Large values push softmax into near-one-hot distributions with vanishing gradients. Scaling keeps variance at ~1, maintaining healthy gradient flow (Vaswani et al., 2017).

Multi-Head Attention

Rather than one attention computation, the model runs h parallel attention heads, each with independently learned Q, K, V projections operating on a subspace of dimension d_head = d_model / h:

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) × W^O
where head_i = Attention(X × W_i^Q, X × W_i^K, X × W_i^V)

Each head can learn different relationship types — one for syntactic dependencies, another for semantic similarity, another for coreference chains. The original transformer used 8 heads; modern LLMs use 32-128. The total parameter cost equals single-head attention with full dimension.

Causal Masking

In decoder-only LLMs, tokens can only attend to previous tokens (and themselves). This is enforced by a lower-triangular mask:

Mask = [[1, 0, 0, 0],
        [1, 1, 0, 0],
        [1, 1, 1, 0],
        [1, 1, 1, 1]]

Positions where mask = 0 are set to -infinity before softmax, making their attention weight effectively zero. This ensures each position can only use past information, matching the autoregressive generation pattern at inference time.

Computational Complexity

OperationTimeSpace
QK^T (attention scores)O(n²d)O(n²)
SoftmaxO(n²)O(n²)
Attention × VO(n²d)O(nd)
TotalO(n²d)O(n²)

The O(n²) scaling with sequence length is the fundamental bottleneck. For 8K context, the attention matrix has 64M entries per head. For 128K, it's 16B entries. This drives the need for Flash Attention and other optimizations.

Training: How LLMs Learn

Pretraining: Next-Token Prediction

LLMs are pretrained on massive text corpora using a deceptively simple objective: predict the next token given all previous tokens (causal language modeling).

L = -1/N × Σ log P(t_i | t_1, ..., t_{i-1})

This cross-entropy loss measures how "surprised" the model is by the actual next token. Lower loss = better predictions. Perplexity is the exponentiated loss: PPL = exp(L). A perplexity of 10 means the model is as uncertain as if choosing uniformly among 10 tokens.

The entire sequence is processed in parallel during training, with causal masking ensuring each position only sees past tokens — matching inference behavior.

Optimization

AdamW (Loshchilov & Hutter, 2019) is the standard optimizer for LLM training. It provides adaptive per-parameter learning rates (via moving averages of gradients and squared gradients), momentum for smoothing noisy gradients, and decoupled weight decay for regularization.

Training typically uses:

  • Learning rate warmup — gradual increase over 1-10% of total steps, allowing Adam's running averages to stabilize before large updates
  • Cosine decay — learning rate gradually decreases to ~10% of peak after warmup
  • Gradient clipping — max norm of 1.0 to prevent training instability

Mixed Precision Training

Modern training uses BF16 (bfloat16) for forward and backward passes while maintaining FP32 master weights (Micikevicius et al., 2018):

  1. Store master weights in FP32
  2. Cast to BF16 for forward pass
  3. Compute loss in FP32 (numerical stability)
  4. Backward pass in BF16
  5. Update master weights in FP32

BF16 is preferred over FP16 because it has the same exponent range as FP32 (preventing overflow). This yields ~2x memory reduction for activations and ~2x faster matrix multiplication on Tensor Cores.

Distributed Training

Training models with hundreds of billions of parameters requires distributing computation across thousands of GPUs. The main parallelism strategies:

Data Parallelism — Each GPU holds a complete model copy, processes different minibatches, and gradients are averaged via all-reduce. Simple but limited by per-GPU memory.

Tensor Parallelism (TP) — Individual weight matrices are split across GPUs. Requires high-bandwidth NVLink interconnect. Typically used within a single node (8 GPUs). (Shoeybi et al., 2020)

Pipeline Parallelism (PP) — Different layers assigned to different GPUs. Micro-batching fills the pipeline to minimize idle time. Can span across nodes.

Fully Sharded Data Parallelism (FSDP/ZeRO) — Shards model parameters, gradients, and optimizer states across all GPUs. Parameters are gathered on-demand for computation, then resharded. (Rajbhandari et al., 2020)

Llama 3 405B was trained with 8-way tensor parallelism within each node, 16-way pipeline parallelism across nodes, and data parallelism across node groups (Grattafiori et al., 2024).

Key Hyperparameters

Architecture Parameters

ParameterDescription7-8B70B405B
Layers (L)Transformer blocks3280126
Hidden dim (d_model)Residual stream width4096819216384
Attention heads (H)Query heads3264128
KV headsKey-value heads (GQA)888
Head dim (d_head)d_model / H128128128
FFN dimInner FFN dimension~11K~22K~44K
Vocab sizeToken vocabulary128K128K128K
Context lengthMax sequence8K8K8K

Values from Llama 3 (Grattafiori et al., 2024)

Key Relationships

  • d_head = d_model / num_heads — head dimension is derived, not independent
  • FFN dim ≈ 8/3 × d_model — for SwiGLU (vs 4x for vanilla ReLU)
  • Total params ≈ 12 × L × d_model² — rough approximation for decoder-only models
  • Training tokens ≈ 20 × params — Chinchilla compute-optimal (though modern models exceed this)

Training Parameters

ParameterTypical ValuePurpose
Peak learning rate1e-4 to 3e-4Step size for weight updates
Warmup2000-4000 stepsStabilize optimizer statistics
LR scheduleCosine decayGradual reduction to ~10% of peak
Batch sizeMillions of tokensRamps up during training
Weight decay0.1L2 regularization via AdamW
Dropout0.0Not needed for large-scale pretraining
Gradient clipping1.0Prevent training instability

Why zero dropout? Large models trained on massive datasets are in an under-fitting regime — there's so much data that overfitting isn't the primary concern. Dropout would reduce effective capacity. Weight decay alone provides sufficient regularization (Grattafiori et al., 2024).

Inference and Decoding

Autoregressive Generation

LLMs generate text one token at a time:

  1. Prefill phase — process the entire prompt through the model (compute-bound)
  2. Sample/select the next token from the output probability distribution
  3. Append the token to the sequence
  4. Decode phase — process the new token (memory-bandwidth-bound, uses KV cache)
  5. Repeat steps 2-4 until a stop condition (EOS token, max length)

Decoding Strategies

Greedy decoding — always pick the highest-probability token. Deterministic and fast but often repetitive and boring. Locally optimal choices don't guarantee globally optimal sequences.

Beam search — maintain k candidate sequences. Better for tasks with a "right answer" (translation, summarization) but still deterministic and tends toward generic outputs. Rarely used for open-ended generation.

Temperature scaling adjusts the "sharpness" of the probability distribution:

P(t_i) = exp(logit_i / T) / Σ exp(logit_j / T)
TemperatureEffect
T → 0Equivalent to greedy (argmax)
T < 1Sharper, more deterministic
T = 1Original distribution
T > 1Flatter, more random/creative

Top-p (nucleus) sampling (Holtzman et al., 2020) — sort tokens by probability, accumulate until the sum exceeds p, truncate, renormalize, and sample. This dynamically adjusts the candidate set: confident predictions yield a small set, uncertain predictions yield a larger one. Superior to fixed top-k because it adapts to the model's confidence.

Min-p sampling — scales the threshold proportional to the model's maximum probability: threshold = min_p × max_probability. Preserves coherence even at high temperatures.

Common Parameter Combinations

Use CaseTemperatureTop-p
Code generation0.0–0.2
Factual Q&A0.0–0.30.9
Creative writing0.7–1.00.95
Brainstorming1.0–1.20.95

The KV Cache

Why It Exists

During autoregressive generation, self-attention at each step needs Keys and Values from all previous tokens. Without caching, generating n tokens requires O(n²) total token computations — reprocessing the entire growing sequence at every step.

The KV cache stores previously computed K and V vectors:

Step 1: Process [1,2,3,4,5] → Store K₁..₅, V₁..₅
Step 2: Process [6] only   → Compute K₆, V₆, attend using cached K₁..₆, V₁..₆
Step 3: Process [7] only   → Compute K₇, V₇, attend using cached K₁..₇, V₁..₇

This reduces generation from O(n²) to O(n) token computations after the initial prefill.

Memory Cost

KV cache memory per token per layer:

memory = 2 × num_kv_heads × d_head × bytes_per_element

For Llama 3 8B in BF16: 32 layers × 8 KV heads × 128 d_head × 2 bytes × 2 (K+V) = 128 KB per token. At 8K context, that's 1 GB. At 128K context, 16 GB.

For Llama 3 70B: 80 layers × 8 KV heads × 128 d_head × 2 bytes × 2 = 320 KB per token. At 8K context, 2.5 GB.

The KV cache can consume up to 70% of total GPU memory during inference and grows linearly with sequence length and batch size.

Two Phases of Inference

  1. Prefill — processes the entire prompt at once. Compute-bound (matrix multiplications dominate).
  2. Decode — generates tokens one by one. Memory-bandwidth-bound (reading the KV cache from HBM dominates). The GPU spends most time reading gigabytes of cached data to do very little math.

This is why inference latency is dominated by the decode phase, and why memory bandwidth (not FLOPS) is the primary bottleneck for LLM serving.

KV Cache Compression

Several techniques reduce KV cache footprint:

  • Quantization — compress KV entries from BF16 to INT8 or INT4. KVQuant achieves up to 10M token context via aggressive KV quantization (Hooper et al., 2024)
  • Token eviction — keep only the most-attended tokens. H2O (Heavy Hitter Oracle) identifies and retains high-attention tokens (Zhang et al., 2023)
  • Attention sinks — always retain the first few tokens, which models disproportionately attend to regardless of content (Xiao et al., 2024)

Inference Acceleration

Quantization

Reduces numerical precision of model weights to decrease memory footprint and increase throughput.

MethodTargetQualitySpeedUse Case
GPTQNVIDIA GPUGood (~90%)Fastest (w/ Marlin)Production GPU serving
AWQNVIDIA GPUBest (~95%)Fast (w/ Marlin)Quality-sensitive GPU serving
GGUFCPU/Apple/GPUGood (~92%)ModerateLocal inference, edge devices
FP8H100/H200ExcellentVery fastNative H100+ support

GPTQ (Frantar et al., 2023) — uses second-order (Hessian) information to determine which weights are most sensitive to quantization. Minimizes output error layer by layer using a calibration dataset.

AWQ (Lin et al., 2024) — identifies the ~1% of weights that are disproportionately important based on activation patterns and protects them during quantization. Near-FP16 quality at INT4.

GGUF / llama.cpp — CPU-friendly quantization format supporting 1.5-bit to 8-bit precision. Q4_K_M is the sweet spot: ~70% smaller files, 90-95% quality retention. Runs on Apple Silicon (Metal), x86 (AVX2/AVX512), NVIDIA (CUDA), and AMD (HIP). See our local LLM inference guide for detailed comparisons.

Flash Attention

An IO-aware, exact attention algorithm that eliminates the memory bottleneck of standard attention (Dao et al., 2022).

The problem: standard attention materializes the full N×N attention matrix in HBM (slow global GPU memory). Each element is written then read back multiple times.

Flash Attention fixes this by:

  1. Tiling — splits Q, K, V into blocks that fit in SRAM (fast on-chip memory)
  2. Incremental softmax — computes softmax across blocks using running statistics (online softmax trick)
  3. No materialization — never writes the full N×N matrix to HBM
  4. Recomputation — in the backward pass, recomputes attention on-chip instead of reading it

The result: up to 7.6x faster than standard attention, with linear memory in sequence length (vs quadratic). Crucially, Flash Attention is exact — it computes mathematically identical results to standard attention. It's purely an IO optimization.

Flash Attention 2 (Dao, 2023) improved work partitioning across GPU threads, reducing synchronization overhead for another ~2x speedup.

Speculative Decoding

Uses a small "draft" model to predict multiple tokens, verified in parallel by the large "target" model (Leviathan et al., 2023):

  1. Draft model generates K candidates quickly (K = 5-8)
  2. Target model verifies all K in a single forward pass
  3. Accepted tokens match what the target would have generated
  4. On first mismatch, reject that token and all subsequent
  5. Target generates the correct token for the rejected position

Lossless — the verification step guarantees the output is statistically identical to the target model alone. Typical speedup: 2-3x. EAGLE achieves ~80% acceptance rate with up to 3.6x speedup (Li et al., 2024).

Continuous Batching and PagedAttention

Continuous batching (Yu et al., 2022, Orca) changes the batch composition at every decode step. When a request finishes, it's immediately replaced by a new one. This eliminates idle GPU cycles from static batching, achieving up to 36.9x higher throughput.

PagedAttention (Kwon et al., 2023, vLLM) manages KV cache memory like an OS manages virtual memory — fixed-size pages allocated dynamically, non-contiguous in physical memory. This eliminates KV cache fragmentation and enables memory sharing between requests with common prefixes (e.g., shared system prompts).

Both techniques are now standard in production serving frameworks like vLLM, TensorRT-LLM, and TGI.

Fine-Tuning and Alignment

The Three Stages

A raw pretrained transformer is a powerful text predictor but a poor assistant. Post-training transforms it into something useful:

Stage 1: Supervised Fine-Tuning (SFT) — train on curated (instruction, response) pairs. The model learns formatting, instruction following, and conversational patterns. Key work: FLAN (Wei et al., 2022), InstructGPT (Ouyang et al., 2022).

Stage 2: Preference Optimization — align the model with human preferences using either RLHF or DPO.

Stage 3: Safety Training — additional alignment for harmlessness. Anthropic's Constitutional AI (Bai et al., 2022) uses AI-generated feedback (RLAIF) based on predefined principles rather than relying entirely on human annotators.

RLHF vs DPO

RLHF (Reinforcement Learning from Human Feedback) is a three-phase process:

  1. Collect human preference data (rank multiple responses per prompt)
  2. Train a reward model to predict preferences
  3. Use PPO to optimize the LLM against the reward model, with a KL penalty to prevent divergence

DPO (Direct Preference Optimization) (Rafailov et al., 2023) eliminates the reward model entirely by directly optimizing the LLM using preference pairs in a single training stage. The key insight: the optimal RLHF policy can be derived in closed form as a function of the preference data.

DPO advantages: simpler pipeline (one stage vs three), more stable training (no PPO instability), faster convergence, lower compute cost, and empirically matches or exceeds RLHF quality. DPO has become the preferred alignment method as of 2025.

LoRA and QLoRA

LoRA (Low-Rank Adaptation) (Hu et al., 2021) exploits the fact that weight updates during fine-tuning occupy a low-rank subspace:

W' = W + ΔW = W + BA
where B is (d × r), A is (r × d), r << d

For d=4096, r=16: 131K trainable params vs 16.7M for full fine-tuning — a 128x reduction. B is initialized to zero so the model starts identical to the pretrained version. At inference, the adapter merges into the base weights with zero latency overhead.

QLoRA (Dettmers et al., 2023) combines 4-bit NormalFloat quantization of base weights with BF16 LoRA adapters. This enables fine-tuning a 65B parameter model on a single 48GB GPU.

Catastrophic Forgetting

Fine-tuning on new data can destroy pretrained capabilities. Mitigation strategies:

  • Mix general data with task-specific data during fine-tuning
  • Use lower learning rates (1e-5 to 5e-5 vs 1e-4 for pretraining)
  • LoRA (freeze base weights, though not fully immune)
  • Elastic Weight Consolidation — penalize changes to important weights (Kirkpatrick et al., 2017)

Scaling Laws

Chinchilla: The Compute-Optimal Ratio

Hoffmann et al. (2022) trained 400+ models and discovered the compute-optimal relationship:

D_optimal ≈ 20 × N

For every doubling of model size, training tokens should also double. This proved GPT-3 (175B params, 300B tokens) was severely undertrained — Chinchilla-optimal would have been 3.5T tokens. Chinchilla (70B params, 1.4T tokens) outperformed the 4x larger Gopher by being compute-optimally trained.

Beyond Chinchilla: Inference-Optimal Training

Modern practice exceeds the Chinchilla ratio because inference cost matters more than training cost:

ModelParamsTraining TokensTokens/Param
Chinchilla70B1.4T20x
Llama 270B2T29x
Llama 3 8B8B15T1,875x
Llama 3 70B70B15T214x

A smaller model trained on more data can match a larger model's quality while being cheaper to serve. Training happens once; inference happens millions of times. This is sometimes called "inference-optimal" training — spend more on training to get a smaller, cheaper-to-deploy model.

Emergent Abilities

Certain capabilities appear to "emerge" suddenly at scale rather than improving gradually — few-shot arithmetic around 10B+ parameters, chain-of-thought reasoning around 100B+ (Wei et al., 2022).

Controversy: Schaeffer et al. (2023) argue emergence is an artifact of evaluation metrics. When measured with continuous metrics (token-level accuracy) instead of discontinuous ones (exact match), improvements are typically smooth. The practical takeaway: scaling reliably improves capabilities, but whether a specific benchmark shows a "jump" depends on the evaluation methodology.

Modern Innovations: The 2025-2026 LLM Stack

Nearly every frontier model uses this combination. These aren't optional upgrades — they're the standard architecture.

RoPE (Rotary Positional Embeddings)

Su et al. (2022) — used by Llama, Mistral, Qwen, Gemma, and nearly all modern LLMs.

Self-attention is permutation-invariant — it has no concept of token order. RoPE injects positional information by applying position-dependent rotations to Q and K vectors before computing attention. For position m, each pair of features is rotated by angle m × θ:

θ_i = base^(-2i/d)    where base = 10000

The key mathematical property: the dot product of rotated Q at position m and K at position n depends only on the relative distance (m-n), naturally encoding relative position.

Advantages over sinusoidal/learned positional encodings:

  • Relative position is directly in the attention scores
  • No additional learnable parameters
  • Better length generalization with extensions like YaRN (Peng et al., 2023)
  • Compatible with Flash Attention

GQA (Grouped Query Attention)

Ainslie et al. (2023) — used by Llama 3, Mistral, Gemma, Qwen.

Standard multi-head attention (MHA) gives every head its own K and V — expensive to store. Multi-Query Attention (MQA) uses a single K,V for all heads — too aggressive, degrades quality.

GQA is the middle ground: multiple query heads share the same K,V heads.

TypeQ HeadsKV HeadsKV Cache Size
MHAHH1x
GQAHG (G < H)G/H of MHA
MQAH11/H of MHA

Llama 3 uses 32 query heads and 8 KV heads — every 4 query heads share 1 KV head pair. This gives a 4x reduction in KV cache with minimal quality loss compared to full MHA.

SwiGLU Activation

Shazeer (2020) — used by Llama, Mistral, PaLM, Gemma.

Replaces the standard ReLU FFN with a gated activation:

Standard: FFN(x) = ReLU(xW₁) × W₂           (2 weight matrices)
SwiGLU:   FFN(x) = (Swish(xW₁) ⊙ xW₂) × W₃  (3 weight matrices)

The element-wise product (gating) lets the network selectively pass information — one branch determines what passes through. This is strictly more expressive than ReLU. To maintain parameter parity with the extra matrix, the inner dimension is reduced from 4×d_model to 8/3×d_model.

Empirically, SwiGLU produces better perplexity per FLOP than ReLU or GELU.

RMSNorm

Zhang & Sennrich (2019) — used by Llama, Mistral, Gemma, Qwen.

Simplifies Layer Normalization by dropping mean centering:

LayerNorm(x) = γ × (x - mean(x)) / sqrt(var(x) + ε) + β
RMSNorm(x)   = γ × x / sqrt(mean(x²) + ε)

~15% faster, fewer parameters (no β), and empirically matches LayerNorm quality for transformers.

Mixture of Experts (MoE)

Used by: Mixtral (Jiang et al., 2024), DeepSeek, Llama 4 Scout

Replace the single FFN per block with multiple "expert" FFNs and a router:

Standard: x → FFN(x) → output
MoE:      x → Router(x) → select top-k experts → Σ(expert_i(x)) → output

Mixtral 8x7B has 46.7B total parameters but only 12.9B active per token (top-2 routing). It matches Llama 2 70B quality with 3-4x less compute per token.

The tradeoff: MoE models require more total memory (all experts loaded) but use less compute per token, making them faster and cheaper at inference time.

Putting It All Together

Here's the complete pipeline — from text to prediction — in a modern LLM:

  1. Tokenize — BPE/SentencePiece splits text into subword tokens mapped to integer IDs
  2. Embed — look up each token ID in the embedding matrix (vocab_size × d_model)
  3. Apply RoPE — encode position information via rotation of Q, K vectors
  4. Process through N decoder blocks, each containing:
    • RMSNorm → GQA attention (with causal mask and KV cache) → residual
    • RMSNorm → SwiGLU FFN → residual
  5. Final RMSNorm → linear projection to vocabulary → softmax
  6. Decode — sample next token via temperature + top-p, append to KV cache, repeat

At training time, the full sequence is processed in parallel with causal masking. At inference time, the prefill phase processes the prompt in one pass, then the decode phase generates tokens one at a time using the KV cache.

Quick Reference: Key Numbers

MetricValue
Transformer paperVaswani et al., 2017
Chinchilla optimal ratio20 tokens per parameter
Llama 3 8B training data15T tokens (1,875x Chinchilla)
Flash Attention speedupUp to 7.6x over standard
LoRA parameter reduction~100x fewer trainable params
QLoRA memory savingsFine-tune 65B on single 48GB GPU
Speculative decoding speedup2-3x typical, up to 3.6x
Continuous batching improvementUp to 36.9x throughput
KV cache share of GPU memoryUp to 70% during inference
GQA KV cache reduction (Llama 3)4x (32Q/8KV)
Attention complexityO(n²d) time, O(n²) space
RMSNorm speedup over LayerNorm~15%
Mixtral active vs total params12.9B / 46.7B
> How LLMs Work: Premium Report
Get the 24-page PDF with 5 exclusive sections — Model Comparison Matrix, Parameter Calculation Worksheets, ML Interview Cheat Sheet, Annotated Paper Reading List, and 50+ term Glossary.
[Get the Premium Report — $19]

Sources

Research Papers

arXivAttention Is All You Need — Vaswani et al. (2017)arXivThe Llama 3 Herd of Models — Grattafiori et al. (2024)arXivTraining Compute-Optimal Large Language Models (Chinchilla) — Hoffmann et al. (2022)arXivFlashAttention: Fast and Memory-Efficient Exact Attention — Dao et al. (2022)arXivFlashAttention-2: Faster Attention with Better Parallelism — Dao (2023)arXivDirect Preference Optimization (DPO) — Rafailov et al. (2023)arXivLoRA: Low-Rank Adaptation of Large Language Models — Hu et al. (2021)arXivQLoRA: Efficient Finetuning of Quantized LLMs — Dettmers et al. (2023)arXivTraining Language Models to Follow Instructions with Human Feedback (InstructGPT/RLHF) — Ouyang et al. (2022)arXivConstitutional AI: Harmlessness from AI Feedback — Bai et al. (2022)arXivRoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE) — Su et al. (2022)arXivGQA: Training Generalized Multi-Query Transformer Models — Ainslie et al. (2023)arXivGLU Variants Improve Transformer (SwiGLU) — Shazeer (2020)arXivRoot Mean Square Layer Normalization (RMSNorm) — Zhang & Sennrich (2019)arXivMixtral of Experts — Jiang et al. (2024)arXivEfficient Memory Management for LLM Serving with PagedAttention (vLLM) — Kwon et al. (2023)arXivOrca: A Distributed Serving System for Transformer-Based Models — Yu et al. (2022)arXivFast Inference from Transformers via Speculative Decoding — Leviathan et al. (2023)arXivEAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Li et al. (2024)arXivGPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers — Frantar et al. (2023)arXivAWQ: Activation-Aware Weight Quantization — Lin et al. (2024)arXivKVQuant: Towards 10M Context Length via KV Cache Quantization — Hooper et al. (2024)arXivH2O: Heavy-Hitter Oracle for Efficient KV Cache — Zhang et al. (2023)arXivEfficient Streaming Language Models with Attention Sinks — Xiao et al. (2024)arXivThe Curious Case of Neural Text Degeneration (Nucleus Sampling) — Holtzman et al. (2020)arXivNeural Machine Translation of Rare Words with Subword Units (BPE) — Sennrich et al. (2016)arXivSentencePiece: A Simple and Language Independent Tokenizer — Kudo & Richardson (2018)arXivDeep Residual Learning for Image Recognition — He et al. (2015)arXivOn Layer Normalization in the Transformer Architecture (Pre-Norm) — Xiong et al. (2020)arXivDecoupled Weight Decay Regularization (AdamW) — Loshchilov & Hutter (2019)arXivMixed Precision Training — Micikevicius et al. (2018)arXivMegatron-LM: Training Multi-Billion Parameter Language Models (Tensor Parallelism) — Shoeybi et al. (2020)arXivZeRO: Memory Optimizations Toward Training Trillion Parameter Models — Rajbhandari et al. (2020)arXivEmergent Abilities of Large Language Models — Wei et al. (2022)arXivAre Emergent Abilities of Large Language Models a Mirage? — Schaeffer et al. (2023)arXivYaRN: Efficient Context Window Extension of Large Language Models — Peng et al. (2023)arXivFinetuned Language Models Are Zero-Shot Learners (FLAN) — Wei et al. (2022)arXivOvercoming Catastrophic Forgetting (EWC) — Kirkpatrick et al. (2017)arXivScaling Laws for Vocabulary Size in LLMs — Tao et al. (2024)

Additional References