How do large language models work?

LLMs are built on the transformer architecture (Vaswani et al., 2017). They process text as sequences of tokens, use self-attention to weigh relationships between all tokens in parallel, and predict the next token based on all previous tokens. Modern LLMs like GPT-4o, Claude, and Llama use decoder-only transformers with innovations like RoPE positional encoding, Grouped Query Attention, SwiGLU activations, and RMSNorm. They are trained on trillions of tokens using next-token prediction, then aligned to follow instructions via RLHF or DPO.

What is the attention mechanism in transformers?

Self-attention computes how much each token in a sequence should attend to every other token. Each token is projected into Query (Q), Key (K), and Value (V) vectors. Attention scores are computed as the scaled dot product of Q and K, passed through softmax, then used to create a weighted sum of V vectors. The formula is Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) × V. Multi-head attention runs this in parallel across multiple heads, each learning different relationship types.

What is the KV cache and why does it matter for LLM inference?

The KV cache stores previously computed Key and Value vectors during autoregressive generation so they do not need to be recomputed at each step. Without it, generating n tokens requires O(n²) computations. With it, each new token only needs one forward pass. The KV cache can consume up to 70% of GPU memory during inference and is the primary bottleneck for long-context generation. Grouped Query Attention (GQA) and KV cache quantization are key techniques for reducing this cost.

What are the Chinchilla scaling laws?

The Chinchilla scaling laws (Hoffmann et al., 2022) showed that compute-optimal LLM training requires roughly 20 tokens per parameter — meaning model size and training data should scale equally. This proved GPT-3 was severely undertrained. Modern models like Llama 3 8B (trained on 15T tokens, 1,875x Chinchilla optimal) intentionally overtrain smaller models because inference cost matters more than training cost in production.

How Large Language Models Work: The Complete Technical Guide to Transformers, Training, and Inference (2026)

Q: What is the difference between RLHF and DPO for LLM alignment?

RLHF trains a separate reward model on human preferences, then uses reinforcement learning (PPO) to optimize the LLM against that reward. DPO eliminates the reward model entirely by directly optimizing the LLM using preference pairs in a single training stage. DPO is simpler, more stable, faster, and cheaper, while achieving comparable or better results. As of 2025-2026, DPO has become the preferred alignment method.

TL;DR: This is a complete technical walkthrough of how large language models work — from the transformer architecture and self-attention to training dynamics, inference optimization, and the modern innovations that power GPT-4o, Claude, Llama 3, and every other frontier model. Every claim is backed by research papers. Whether you're preparing for ML interviews, building LLM-powered applications, or just want to deeply understand the technology reshaping software, this guide covers the full stack.

This guide synthesizes content from two deep-dive references I created for my own learning: a technical interview prep document and an interactive visual guide. I'm combining them here into a single authoritative resource with thorough sourcing.

> How LLMs Work: Premium Report

Get the 24-page PDF with 5 exclusive sections — Model Comparison Matrix, Parameter Calculation Worksheets, ML Interview Cheat Sheet, Annotated Paper Reading List, and 50+ term Glossary.

[Get the Premium Report — $19]

The Transformer Architecture
Tokenization
The Attention Mechanism
Training: How LLMs Learn
Key Hyperparameters
Inference and Decoding
The KV Cache
Inference Acceleration
Fine-Tuning and Alignment
Scaling Laws
Modern Innovations: The 2025-2026 LLM Stack
Putting It All Together
Sources

The Transformer Architecture

Before 2017, NLP relied on Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. These processed text sequentially — one token at a time, left to right — which meant they were slow (no parallelization) and struggled with long-range dependencies. By the time an RNN reached the end of a long paragraph, it had largely "forgotten" the beginning.

The transformer (Vaswani et al., 2017 — "Attention Is All You Need") solved both problems. Instead of processing tokens sequentially, transformers process the entire input in parallel using self-attention. Every token can directly attend to every other token, regardless of distance.

This single change unlocked massive parallelization on GPUs, direct long-range context modeling, transfer learning at scale, and predictable performance scaling with more data and compute.

Decoder-Only: The Modern LLM

The original transformer had an encoder-decoder structure for machine translation. Modern LLMs — GPT-4o, Claude, Llama, Gemini, Mistral — all use only the decoder half with causal masking. No encoder, no cross-attention.

Each decoder block contains:

RMSNorm (pre-normalization) — stabilizes activations before each sub-layer
Masked Multi-Head Attention with GQA and RoPE — contextual mixing between tokens
Residual connection — skip connection around the attention sub-layer
RMSNorm — pre-normalization before the FFN
Feed-Forward Network with SwiGLU — position-wise transformation that stores factual knowledge
Residual connection — skip connection around the FFN

A modern LLM stacks 32-126 of these blocks. Llama 3 8B has 32 layers, Llama 3 70B has 80, and Llama 3 405B has 126 (Grattafiori et al., 2024).

Residual Connections

Skip connections around every sub-layer solve three problems:

Gradient flow — gradients can flow directly backward through the network, enabling training of 100+ layer models without vanishing gradients (He et al., 2015)
Residual learning — each layer learns a delta rather than a complete new representation
Identity path — earlier layer outputs can pass through unchanged if a later layer's weights are near-zero

The formula in a modern pre-norm transformer:

output = x + SubLayer(Norm(x))

Pre-norm (normalizing before the sub-layer) produces more stable training dynamics than post-norm, especially for deep models. Nearly all modern LLMs use pre-norm (Xiong et al., 2020).

Tokenization

LLMs don't process text directly — they operate on sequences of integer token IDs from a fixed vocabulary. All modern LLMs use subword tokenization, which balances three competing concerns:

Level	Vocabulary Size	Sequence Length	Handles Unknown Words?
Character	~256	Very long	Yes
Subword	32K–128K	Moderate	Yes
Word	100K+	Short	No

Byte Pair Encoding (BPE)

BPE (Sennrich et al., 2016) is the most widely used tokenization algorithm. It starts with individual characters and iteratively merges the most frequent adjacent pair into a new token until the desired vocabulary size is reached.

Used by: GPT-2, GPT-3, GPT-4, Llama (via SentencePiece)

The key property is determinism — BPE saves the merge rules, enabling consistent encoding of new text at inference time.

SentencePiece

SentencePiece (Kudo & Richardson, 2018) treats input as raw bytes with no language-specific pre-tokenization. It can use either BPE or Unigram as the underlying algorithm. This makes it language-agnostic — critical for multilingual models.

Used by: Llama, T5, mBART

Vocabulary Size Matters

Model	Vocab Size	Tokens/Param Ratio Impact
GPT-2	50,257	Baseline
GPT-4	~100K	Better multilingual coverage
Llama 2	32K	Optimal for 7B params
Llama 3	128K	4x increase — crucial for code and multilingual

Llama 3's jump from 32K to 128K vocabulary was one of its most impactful changes. Larger vocabulary means fewer tokens per input, which means more content fits in the context window. The embedding table grows, but this is a small fraction of total parameters for a 70B+ model (Grattafiori et al., 2024).

Optimal vocabulary size scales with model size — research shows Llama 2's 32K vocabulary was optimal for 7B parameters, but the 70B model would have benefited from ~216K tokens (Tao et al., 2024).

The Attention Mechanism

Attention is the core innovation of the transformer. It lets every token "look at" every other token and decide which are relevant.

Scaled Dot-Product Attention

The formula:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) × V

Step by step:

Project input X through learned weight matrices to get Q (queries), K (keys), V (values)
Compute scores — dot product of Q and K gives raw relevance scores
Scale — divide by sqrt(d_k) to prevent softmax saturation
Mask — set future positions to -infinity (causal masking for autoregressive generation)
Softmax — convert scores to a probability distribution (rows sum to 1)
Weighted sum — multiply attention weights by V to produce context-enriched representations

Why scale by sqrt(d_k)? When d_k is large (e.g., 128), dot products grow in magnitude — their variance scales proportionally with d_k. Large values push softmax into near-one-hot distributions with vanishing gradients. Scaling keeps variance at ~1, maintaining healthy gradient flow (Vaswani et al., 2017).

Multi-Head Attention

Rather than one attention computation, the model runs h parallel attention heads, each with independently learned Q, K, V projections operating on a subspace of dimension d_head = d_model / h:

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) × W^O
where head_i = Attention(X × W_i^Q, X × W_i^K, X × W_i^V)

Each head can learn different relationship types — one for syntactic dependencies, another for semantic similarity, another for coreference chains. The original transformer used 8 heads; modern LLMs use 32-128. The total parameter cost equals single-head attention with full dimension.

Causal Masking

In decoder-only LLMs, tokens can only attend to previous tokens (and themselves). This is enforced by a lower-triangular mask:

Mask = [[1, 0, 0, 0],
        [1, 1, 0, 0],
        [1, 1, 1, 0],
        [1, 1, 1, 1]]

Positions where mask = 0 are set to -infinity before softmax, making their attention weight effectively zero. This ensures each position can only use past information, matching the autoregressive generation pattern at inference time.

Computational Complexity

Operation	Time	Space
QK^T (attention scores)	O(n²d)	O(n²)
Softmax	O(n²)	O(n²)
Attention × V	O(n²d)	O(nd)
Total	O(n²d)	O(n²)

The O(n²) scaling with sequence length is the fundamental bottleneck. For 8K context, the attention matrix has 64M entries per head. For 128K, it's 16B entries. This drives the need for Flash Attention and other optimizations.

Training: How LLMs Learn

Pretraining: Next-Token Prediction

LLMs are pretrained on massive text corpora using a deceptively simple objective: predict the next token given all previous tokens (causal language modeling).

L = -1/N × Σ log P(t_i | t_1, ..., t_{i-1})

This cross-entropy loss measures how "surprised" the model is by the actual next token. Lower loss = better predictions. Perplexity is the exponentiated loss: PPL = exp(L). A perplexity of 10 means the model is as uncertain as if choosing uniformly among 10 tokens.

The entire sequence is processed in parallel during training, with causal masking ensuring each position only sees past tokens — matching inference behavior.

Optimization

AdamW (Loshchilov & Hutter, 2019) is the standard optimizer for LLM training. It provides adaptive per-parameter learning rates (via moving averages of gradients and squared gradients), momentum for smoothing noisy gradients, and decoupled weight decay for regularization.

Training typically uses:

Learning rate warmup — gradual increase over 1-10% of total steps, allowing Adam's running averages to stabilize before large updates
Cosine decay — learning rate gradually decreases to ~10% of peak after warmup
Gradient clipping — max norm of 1.0 to prevent training instability

Mixed Precision Training

Modern training uses BF16 (bfloat16) for forward and backward passes while maintaining FP32 master weights (Micikevicius et al., 2018):

Store master weights in FP32
Cast to BF16 for forward pass
Compute loss in FP32 (numerical stability)
Backward pass in BF16
Update master weights in FP32

BF16 is preferred over FP16 because it has the same exponent range as FP32 (preventing overflow). This yields ~2x memory reduction for activations and ~2x faster matrix multiplication on Tensor Cores.

Distributed Training

Training models with hundreds of billions of parameters requires distributing computation across thousands of GPUs. The main parallelism strategies:

Data Parallelism — Each GPU holds a complete model copy, processes different minibatches, and gradients are averaged via all-reduce. Simple but limited by per-GPU memory.

Tensor Parallelism (TP) — Individual weight matrices are split across GPUs. Requires high-bandwidth NVLink interconnect. Typically used within a single node (8 GPUs). (Shoeybi et al., 2020)

Pipeline Parallelism (PP) — Different layers assigned to different GPUs. Micro-batching fills the pipeline to minimize idle time. Can span across nodes.

Fully Sharded Data Parallelism (FSDP/ZeRO) — Shards model parameters, gradients, and optimizer states across all GPUs. Parameters are gathered on-demand for computation, then resharded. (Rajbhandari et al., 2020)

Llama 3 405B was trained with 8-way tensor parallelism within each node, 16-way pipeline parallelism across nodes, and data parallelism across node groups (Grattafiori et al., 2024).

Key Hyperparameters

Architecture Parameters

Parameter	Description	7-8B	70B	405B
Layers (L)	Transformer blocks	32	80	126
Hidden dim (d_model)	Residual stream width	4096	8192	16384
Attention heads (H)	Query heads	32	64	128
KV heads	Key-value heads (GQA)	8	8	8
Head dim (d_head)	d_model / H	128	128	128
FFN dim	Inner FFN dimension	~11K	~22K	~44K
Vocab size	Token vocabulary	128K	128K	128K
Context length	Max sequence	8K	8K	8K

Values from Llama 3 (Grattafiori et al., 2024)

Key Relationships

d_head = d_model / num_heads — head dimension is derived, not independent
FFN dim ≈ 8/3 × d_model — for SwiGLU (vs 4x for vanilla ReLU)
Total params ≈ 12 × L × d_model² — rough approximation for decoder-only models
Training tokens ≈ 20 × params — Chinchilla compute-optimal (though modern models exceed this)

Training Parameters

Parameter	Typical Value	Purpose
Peak learning rate	1e-4 to 3e-4	Step size for weight updates
Warmup	2000-4000 steps	Stabilize optimizer statistics
LR schedule	Cosine decay	Gradual reduction to ~10% of peak
Batch size	Millions of tokens	Ramps up during training
Weight decay	0.1	L2 regularization via AdamW
Dropout	0.0	Not needed for large-scale pretraining
Gradient clipping	1.0	Prevent training instability

Why zero dropout? Large models trained on massive datasets are in an under-fitting regime — there's so much data that overfitting isn't the primary concern. Dropout would reduce effective capacity. Weight decay alone provides sufficient regularization (Grattafiori et al., 2024).

Inference and Decoding

Autoregressive Generation

LLMs generate text one token at a time:

Prefill phase — process the entire prompt through the model (compute-bound)
Sample/select the next token from the output probability distribution
Append the token to the sequence
Decode phase — process the new token (memory-bandwidth-bound, uses KV cache)
Repeat steps 2-4 until a stop condition (EOS token, max length)

Decoding Strategies

Greedy decoding — always pick the highest-probability token. Deterministic and fast but often repetitive and boring. Locally optimal choices don't guarantee globally optimal sequences.

Beam search — maintain k candidate sequences. Better for tasks with a "right answer" (translation, summarization) but still deterministic and tends toward generic outputs. Rarely used for open-ended generation.

Temperature scaling adjusts the "sharpness" of the probability distribution:

P(t_i) = exp(logit_i / T) / Σ exp(logit_j / T)

Temperature	Effect
T → 0	Equivalent to greedy (argmax)
T < 1	Sharper, more deterministic
T = 1	Original distribution
T > 1	Flatter, more random/creative

Top-p (nucleus) sampling (Holtzman et al., 2020) — sort tokens by probability, accumulate until the sum exceeds p, truncate, renormalize, and sample. This dynamically adjusts the candidate set: confident predictions yield a small set, uncertain predictions yield a larger one. Superior to fixed top-k because it adapts to the model's confidence.

Min-p sampling — scales the threshold proportional to the model's maximum probability: threshold = min_p × max_probability. Preserves coherence even at high temperatures.

Common Parameter Combinations

Use Case	Temperature	Top-p
Code generation	0.0–0.2	—
Factual Q&A	0.0–0.3	0.9
Creative writing	0.7–1.0	0.95
Brainstorming	1.0–1.2	0.95

The KV Cache

Why It Exists

During autoregressive generation, self-attention at each step needs Keys and Values from all previous tokens. Without caching, generating n tokens requires O(n²) total token computations — reprocessing the entire growing sequence at every step.

The KV cache stores previously computed K and V vectors:

Step 1: Process [1,2,3,4,5] → Store K₁..₅, V₁..₅
Step 2: Process [6] only   → Compute K₆, V₆, attend using cached K₁..₆, V₁..₆
Step 3: Process [7] only   → Compute K₇, V₇, attend using cached K₁..₇, V₁..₇

This reduces generation from O(n²) to O(n) token computations after the initial prefill.

Memory Cost

KV cache memory per token per layer:

memory = 2 × num_kv_heads × d_head × bytes_per_element

For Llama 3 8B in BF16: 32 layers × 8 KV heads × 128 d_head × 2 bytes × 2 (K+V) = 128 KB per token. At 8K context, that's 1 GB. At 128K context, 16 GB.

For Llama 3 70B: 80 layers × 8 KV heads × 128 d_head × 2 bytes × 2 = 320 KB per token. At 8K context, 2.5 GB.

The KV cache can consume up to 70% of total GPU memory during inference and grows linearly with sequence length and batch size.

Two Phases of Inference

Prefill — processes the entire prompt at once. Compute-bound (matrix multiplications dominate).
Decode — generates tokens one by one. Memory-bandwidth-bound (reading the KV cache from HBM dominates). The GPU spends most time reading gigabytes of cached data to do very little math.

This is why inference latency is dominated by the decode phase, and why memory bandwidth (not FLOPS) is the primary bottleneck for LLM serving.

KV Cache Compression

Several techniques reduce KV cache footprint:

Quantization — compress KV entries from BF16 to INT8 or INT4. KVQuant achieves up to 10M token context via aggressive KV quantization (Hooper et al., 2024)
Token eviction — keep only the most-attended tokens. H2O (Heavy Hitter Oracle) identifies and retains high-attention tokens (Zhang et al., 2023)
Attention sinks — always retain the first few tokens, which models disproportionately attend to regardless of content (Xiao et al., 2024)

Inference Acceleration

Quantization

Reduces numerical precision of model weights to decrease memory footprint and increase throughput.

Method	Target	Quality	Speed	Use Case
GPTQ	NVIDIA GPU	Good (~90%)	Fastest (w/ Marlin)	Production GPU serving
AWQ	NVIDIA GPU	Best (~95%)	Fast (w/ Marlin)	Quality-sensitive GPU serving
GGUF	CPU/Apple/GPU	Good (~92%)	Moderate	Local inference, edge devices
FP8	H100/H200	Excellent	Very fast	Native H100+ support

GPTQ (Frantar et al., 2023) — uses second-order (Hessian) information to determine which weights are most sensitive to quantization. Minimizes output error layer by layer using a calibration dataset.

AWQ (Lin et al., 2024) — identifies the ~1% of weights that are disproportionately important based on activation patterns and protects them during quantization. Near-FP16 quality at INT4.

GGUF / llama.cpp — CPU-friendly quantization format supporting 1.5-bit to 8-bit precision. Q4_K_M is the sweet spot: ~70% smaller files, 90-95% quality retention. Runs on Apple Silicon (Metal), x86 (AVX2/AVX512), NVIDIA (CUDA), and AMD (HIP). See our local LLM inference guide for detailed comparisons.

Flash Attention

An IO-aware, exact attention algorithm that eliminates the memory bottleneck of standard attention (Dao et al., 2022).

The problem: standard attention materializes the full N×N attention matrix in HBM (slow global GPU memory). Each element is written then read back multiple times.

Flash Attention fixes this by:

Tiling — splits Q, K, V into blocks that fit in SRAM (fast on-chip memory)
Incremental softmax — computes softmax across blocks using running statistics (online softmax trick)
No materialization — never writes the full N×N matrix to HBM
Recomputation — in the backward pass, recomputes attention on-chip instead of reading it

The result: up to 7.6x faster than standard attention, with linear memory in sequence length (vs quadratic). Crucially, Flash Attention is exact — it computes mathematically identical results to standard attention. It's purely an IO optimization.

Flash Attention 2 (Dao, 2023) improved work partitioning across GPU threads, reducing synchronization overhead for another ~2x speedup.

Speculative Decoding

Uses a small "draft" model to predict multiple tokens, verified in parallel by the large "target" model (Leviathan et al., 2023):

Draft model generates K candidates quickly (K = 5-8)
Target model verifies all K in a single forward pass
Accepted tokens match what the target would have generated
On first mismatch, reject that token and all subsequent
Target generates the correct token for the rejected position

Lossless — the verification step guarantees the output is statistically identical to the target model alone. Typical speedup: 2-3x. EAGLE achieves ~80% acceptance rate with up to 3.6x speedup (Li et al., 2024).

Continuous Batching and PagedAttention

Continuous batching (Yu et al., 2022, Orca) changes the batch composition at every decode step. When a request finishes, it's immediately replaced by a new one. This eliminates idle GPU cycles from static batching, achieving up to 36.9x higher throughput.

PagedAttention (Kwon et al., 2023, vLLM) manages KV cache memory like an OS manages virtual memory — fixed-size pages allocated dynamically, non-contiguous in physical memory. This eliminates KV cache fragmentation and enables memory sharing between requests with common prefixes (e.g., shared system prompts).

Both techniques are now standard in production serving frameworks like vLLM, TensorRT-LLM, and TGI.

Fine-Tuning and Alignment

The Three Stages

A raw pretrained transformer is a powerful text predictor but a poor assistant. Post-training transforms it into something useful:

Stage 1: Supervised Fine-Tuning (SFT) — train on curated (instruction, response) pairs. The model learns formatting, instruction following, and conversational patterns. Key work: FLAN (Wei et al., 2022), InstructGPT (Ouyang et al., 2022).

Stage 2: Preference Optimization — align the model with human preferences using either RLHF or DPO.

Stage 3: Safety Training — additional alignment for harmlessness. Anthropic's Constitutional AI (Bai et al., 2022) uses AI-generated feedback (RLAIF) based on predefined principles rather than relying entirely on human annotators.

RLHF vs DPO

RLHF (Reinforcement Learning from Human Feedback) is a three-phase process:

Collect human preference data (rank multiple responses per prompt)
Train a reward model to predict preferences
Use PPO to optimize the LLM against the reward model, with a KL penalty to prevent divergence

DPO (Direct Preference Optimization) (Rafailov et al., 2023) eliminates the reward model entirely by directly optimizing the LLM using preference pairs in a single training stage. The key insight: the optimal RLHF policy can be derived in closed form as a function of the preference data.

DPO advantages: simpler pipeline (one stage vs three), more stable training (no PPO instability), faster convergence, lower compute cost, and empirically matches or exceeds RLHF quality. DPO has become the preferred alignment method as of 2025.

LoRA and QLoRA

LoRA (Low-Rank Adaptation) (Hu et al., 2021) exploits the fact that weight updates during fine-tuning occupy a low-rank subspace:

W' = W + ΔW = W + BA
where B is (d × r), A is (r × d), r << d

For d=4096, r=16: 131K trainable params vs 16.7M for full fine-tuning — a 128x reduction. B is initialized to zero so the model starts identical to the pretrained version. At inference, the adapter merges into the base weights with zero latency overhead.

QLoRA (Dettmers et al., 2023) combines 4-bit NormalFloat quantization of base weights with BF16 LoRA adapters. This enables fine-tuning a 65B parameter model on a single 48GB GPU.

Catastrophic Forgetting

Fine-tuning on new data can destroy pretrained capabilities. Mitigation strategies:

Mix general data with task-specific data during fine-tuning
Use lower learning rates (1e-5 to 5e-5 vs 1e-4 for pretraining)
LoRA (freeze base weights, though not fully immune)
Elastic Weight Consolidation — penalize changes to important weights (Kirkpatrick et al., 2017)

Scaling Laws

Chinchilla: The Compute-Optimal Ratio

Hoffmann et al. (2022) trained 400+ models and discovered the compute-optimal relationship:

D_optimal ≈ 20 × N

For every doubling of model size, training tokens should also double. This proved GPT-3 (175B params, 300B tokens) was severely undertrained — Chinchilla-optimal would have been 3.5T tokens. Chinchilla (70B params, 1.4T tokens) outperformed the 4x larger Gopher by being compute-optimally trained.

Beyond Chinchilla: Inference-Optimal Training

Modern practice exceeds the Chinchilla ratio because inference cost matters more than training cost:

Model	Params	Training Tokens	Tokens/Param
Chinchilla	70B	1.4T	20x
Llama 2	70B	2T	29x
Llama 3 8B	8B	15T	1,875x
Llama 3 70B	70B	15T	214x

A smaller model trained on more data can match a larger model's quality while being cheaper to serve. Training happens once; inference happens millions of times. This is sometimes called "inference-optimal" training — spend more on training to get a smaller, cheaper-to-deploy model.

Emergent Abilities

Certain capabilities appear to "emerge" suddenly at scale rather than improving gradually — few-shot arithmetic around 10B+ parameters, chain-of-thought reasoning around 100B+ (Wei et al., 2022).

Controversy: Schaeffer et al. (2023) argue emergence is an artifact of evaluation metrics. When measured with continuous metrics (token-level accuracy) instead of discontinuous ones (exact match), improvements are typically smooth. The practical takeaway: scaling reliably improves capabilities, but whether a specific benchmark shows a "jump" depends on the evaluation methodology.

Modern Innovations: The 2025-2026 LLM Stack

Nearly every frontier model uses this combination. These aren't optional upgrades — they're the standard architecture.

RoPE (Rotary Positional Embeddings)

Su et al. (2022) — used by Llama, Mistral, Qwen, Gemma, and nearly all modern LLMs.

Self-attention is permutation-invariant — it has no concept of token order. RoPE injects positional information by applying position-dependent rotations to Q and K vectors before computing attention. For position m, each pair of features is rotated by angle m × θ:

θ_i = base^(-2i/d)    where base = 10000

The key mathematical property: the dot product of rotated Q at position m and K at position n depends only on the relative distance (m-n), naturally encoding relative position.

Advantages over sinusoidal/learned positional encodings:

Relative position is directly in the attention scores
No additional learnable parameters
Better length generalization with extensions like YaRN (Peng et al., 2023)
Compatible with Flash Attention

GQA (Grouped Query Attention)

Ainslie et al. (2023) — used by Llama 3, Mistral, Gemma, Qwen.

Standard multi-head attention (MHA) gives every head its own K and V — expensive to store. Multi-Query Attention (MQA) uses a single K,V for all heads — too aggressive, degrades quality.

GQA is the middle ground: multiple query heads share the same K,V heads.

Type	Q Heads	KV Heads	KV Cache Size
MHA	H	H	1x
GQA	H	G (G < H)	G/H of MHA
MQA	H	1	1/H of MHA

Llama 3 uses 32 query heads and 8 KV heads — every 4 query heads share 1 KV head pair. This gives a 4x reduction in KV cache with minimal quality loss compared to full MHA.

SwiGLU Activation

Shazeer (2020) — used by Llama, Mistral, PaLM, Gemma.

Replaces the standard ReLU FFN with a gated activation:

Standard: FFN(x) = ReLU(xW₁) × W₂           (2 weight matrices)
SwiGLU:   FFN(x) = (Swish(xW₁) ⊙ xW₂) × W₃  (3 weight matrices)

The element-wise product (gating) lets the network selectively pass information — one branch determines what passes through. This is strictly more expressive than ReLU. To maintain parameter parity with the extra matrix, the inner dimension is reduced from 4×d_model to 8/3×d_model.

Empirically, SwiGLU produces better perplexity per FLOP than ReLU or GELU.

RMSNorm

Zhang & Sennrich (2019) — used by Llama, Mistral, Gemma, Qwen.

Simplifies Layer Normalization by dropping mean centering:

LayerNorm(x) = γ × (x - mean(x)) / sqrt(var(x) + ε) + β
RMSNorm(x)   = γ × x / sqrt(mean(x²) + ε)

~15% faster, fewer parameters (no β), and empirically matches LayerNorm quality for transformers.

Mixture of Experts (MoE)

Used by: Mixtral (Jiang et al., 2024), DeepSeek, Llama 4 Scout

Replace the single FFN per block with multiple "expert" FFNs and a router:

Standard: x → FFN(x) → output
MoE:      x → Router(x) → select top-k experts → Σ(expert_i(x)) → output

Mixtral 8x7B has 46.7B total parameters but only 12.9B active per token (top-2 routing). It matches Llama 2 70B quality with 3-4x less compute per token.

The tradeoff: MoE models require more total memory (all experts loaded) but use less compute per token, making them faster and cheaper at inference time.

Putting It All Together

Here's the complete pipeline — from text to prediction — in a modern LLM:

Tokenize — BPE/SentencePiece splits text into subword tokens mapped to integer IDs
Embed — look up each token ID in the embedding matrix (vocab_size × d_model)
Apply RoPE — encode position information via rotation of Q, K vectors
Process through N decoder blocks, each containing:
- RMSNorm → GQA attention (with causal mask and KV cache) → residual
- RMSNorm → SwiGLU FFN → residual
Final RMSNorm → linear projection to vocabulary → softmax
Decode — sample next token via temperature + top-p, append to KV cache, repeat

At training time, the full sequence is processed in parallel with causal masking. At inference time, the prefill phase processes the prompt in one pass, then the decode phase generates tokens one at a time using the KV cache.

Quick Reference: Key Numbers

Metric	Value
Transformer paper	Vaswani et al., 2017
Chinchilla optimal ratio	20 tokens per parameter
Llama 3 8B training data	15T tokens (1,875x Chinchilla)
Flash Attention speedup	Up to 7.6x over standard
LoRA parameter reduction	~100x fewer trainable params
QLoRA memory savings	Fine-tune 65B on single 48GB GPU
Speculative decoding speedup	2-3x typical, up to 3.6x
Continuous batching improvement	Up to 36.9x throughput
KV cache share of GPU memory	Up to 70% during inference
GQA KV cache reduction (Llama 3)	4x (32Q/8KV)
Attention complexity	O(n²d) time, O(n²) space
RMSNorm speedup over LayerNorm	~15%
Mixtral active vs total params	12.9B / 46.7B