- Published on
- · 17 min read
Updated
Transformers Explained: From Attention Mechanism to GPT-4o, Claude, and Open-Source LLMs (2026)
TL;DR: Transformers are the architecture behind every major AI model you use today — GPT-4o, Claude, Gemini, Llama, and more. Introduced in 2017 with the Attention Is All You Need paper, the transformer replaced RNNs and LSTMs by processing entire sequences in parallel through self-attention. This guide walks you through how transformers work, how they evolved into today's LLMs, and which model to pick for your specific use case — from API-based services like GPT-4o and Claude to open-source models you can run locally with Ollama.
Table of Contents
- What Are Transformers?
- The Attention Mechanism Explained
- Encoder vs Decoder Architecture
- From GPT-1 to GPT-4o: The Evolution
- The Current Model Landscape
- Multimodal Transformers
- Instruction Tuning and RLHF
- Open-Source Models: Llama, Mistral, Qwen, DeepSeek
- Which Model for Which Task?
- Running Models Locally
- Fine-Tuning and Adaptation
- The Future: Mixture of Experts, State Space Models
What Are Transformers?
Before 2017, natural language processing relied on Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. These architectures processed text sequentially — one word at a time, left to right. This created two major problems: they were slow to train (no parallelization) and they struggled with long-range dependencies. By the time an RNN reached the end of a long paragraph, it had largely "forgotten" the beginning.
The transformer architecture, introduced by Vaswani et al. in the paper Attention Is All You Need, solved both problems at once. Instead of processing tokens sequentially, transformers process the entire input sequence in parallel using a mechanism called self-attention. Every token can directly attend to every other token, regardless of distance.
This single architectural change unlocked:
- Massive parallelization — training on GPUs became orders of magnitude faster
- Long-range context — a word at position 1 can directly influence position 10,000
- Transfer learning at scale — pre-train once on massive data, fine-tune for any downstream task
- Scalability — performance improves predictably with more data and compute (scaling laws)
The result was an explosion of capability. Within two years, transformers had replaced RNNs in virtually every NLP benchmark. Within five years, they had expanded beyond text into images, audio, video, and code — becoming the universal architecture for AI.
The Attention Mechanism Explained
The core innovation of transformers is the self-attention mechanism. Here's how it works intuitively, then mechanically.
The Intuition
Consider the sentence: "The cat sat on the mat because it was tired."
What does "it" refer to? A human instantly knows it refers to "the cat," not "the mat." Self-attention gives the model a similar ability: for every token in the sequence, it calculates how much attention to pay to every other token.
The Mechanics: Query, Key, Value
Self-attention works through three learned linear transformations applied to each input token's embedding:
- Query (Q) — "What am I looking for?"
- Key (K) — "What do I contain?"
- Value (V) — "What information do I provide?"
The attention score between two tokens is the dot product of the Query of one token and the Key of another, scaled by the square root of the embedding dimension. These scores are passed through a softmax to produce weights, then used to create a weighted sum of the Value vectors:
Attention(Q, K, V) = softmax(QK^T / √d_k) × V
This produces a new representation for each token that incorporates contextual information from the entire sequence.
Multi-Head Attention
Rather than computing a single attention function, transformers use multi-head attention — multiple attention "heads" running in parallel, each learning different relationship patterns. One head might learn syntactic relationships (subject-verb agreement), another might learn semantic relationships (pronoun resolution), and another might learn positional patterns.
The outputs of all heads are concatenated and projected through a final linear layer. A typical large model uses 32–128 attention heads.
Beyond Attention: The Full Block
Each transformer block contains:
- Multi-head self-attention — contextual mixing between tokens
- Layer normalization — stabilizes training
- Feed-forward network (FFN) — two linear layers with a nonlinearity (typically GELU), applied to each position independently
- Residual connections — skip connections around each sub-layer, enabling gradient flow in deep networks
A modern LLM stacks 32–128 of these blocks, with each block refining the representations produced by the previous one.
Encoder vs Decoder Architecture
The original transformer had both an encoder and decoder. Over time, the field discovered that different configurations excel at different tasks.
Encoder-Only (BERT, RoBERTa)
The encoder processes the full input bidirectionally — every token can attend to every other token in both directions. This makes encoder-only models excellent at understanding tasks:
- Text classification and sentiment analysis
- Named entity recognition
- Semantic search and embeddings
- Question answering (extractive)
BERT, released by Google in 2018, is the canonical encoder-only model. It is pre-trained using masked language modeling (predicting randomly masked tokens) and next sentence prediction.
Decoder-Only (GPT, Claude, Llama)
The decoder processes tokens left-to-right with causal masking — each token can only attend to tokens that came before it. This makes decoder-only models natural at generation tasks:
- Text generation and creative writing
- Code completion
- Conversational AI (chatbots)
- Instruction following
Every major LLM today — GPT-4o, Claude, Gemini, Llama — uses a decoder-only architecture. The field largely converged on this design because generation capabilities proved more versatile and scalable than bidirectional understanding.
Encoder-Decoder (T5, BART)
The encoder processes the full input bidirectionally, then the decoder generates output autoregressively while cross-attending to the encoder's representations. This architecture excels at sequence-to-sequence tasks:
- Machine translation
- Text summarization
- Structured data extraction
Google's T5 treats every NLP task as a text-to-text problem, making it remarkably flexible. However, for most modern applications, decoder-only models have proven sufficient and simpler to scale.
From GPT-1 to GPT-4o: The Evolution
The history of LLMs is largely a story of scaling — more parameters, more data, more compute — with occasional architectural innovations.
| Year | Model | Parameters | Key Innovation |
|---|---|---|---|
| 2018 | GPT-1 | 117M | Pre-training + fine-tuning paradigm |
| 2019 | GPT-2 | 1.5B | Zero-shot task performance, emergent abilities |
| 2020 | GPT-3 | 175B | Few-shot learning via in-context examples |
| 2022 | ChatGPT | ~175B | RLHF alignment, conversational interface |
| 2023 | GPT-4 | ~1.8T (MoE) | Multimodal (text + vision), massive quality jump |
| 2024 | GPT-4o | ~200B (MoE) | Native multimodal (text, vision, audio), faster and cheaper |
| 2024 | Claude 3 | Unknown | 200K context, strong instruction following |
| 2025 | GPT-5 | Unknown | Multi-step planning, long-term memory, agentic workflows |
| 2025 | Claude 4 | Unknown | Extended thinking, advanced coding, 1M context |
| 2025 | Gemini 2.5 | Unknown | 1M context, native multimodal, reasoning traces |
The key transitions:
- GPT-1 → GPT-3: Proved that scale alone produces emergent capabilities (few-shot learning, reasoning)
- GPT-3 → ChatGPT: RLHF alignment made models conversational and instruction-following
- GPT-4 → GPT-4o: Multimodal inputs became native rather than bolted-on
- 2025 models: Focus shifted to reasoning depth, agentic capabilities, and massive context windows
The Current Model Landscape
As of March 2026, the LLM market has three tiers: frontier proprietary models, mid-tier proprietary models, and open-source/open-weight models.
| Model | Provider | Context Window | Multimodal | Strengths | Pricing (per 1M tokens) |
|---|---|---|---|---|---|
| GPT-5.x | OpenAI | 400K+ | Text, image, audio | Code, function calling, agentic tasks | $2–15 input, $8–60 output |
| Claude Opus 4.6 | Anthropic | 200K | Text, image | Reasoning, long-context, instruction following | $15 input, $75 output |
| Claude Sonnet 4.6 | Anthropic | 200K | Text, image | Best balance of speed/quality/cost | $3 input, $15 output |
| Claude Haiku 4.5 | Anthropic | 200K | Text, image | Fast, cheap, good for structured tasks | $0.80 input, $4 output |
| Gemini 2.5 Pro | 1M+ | Text, image, audio, video | Massive context, native multimodal | $1.25–2.50 input, $10 output | |
| Gemini 2.5 Flash | 1M | Text, image, audio, video | Fast and cost-effective | $0.15 input, $3.50 output | |
| Llama 4 Scout | Meta | 10M (MoE) | Text, image | Huge context, open-weight | Free (self-hosted) |
| Llama 4 Maverick | Meta | 128K (MoE) | Text, image | Strong general performance | Free (self-hosted) |
| DeepSeek-V3.2 | DeepSeek | 128K | Text | Reasoning, code, cost-efficient | $0.27 input, $1.10 output |
| Mistral Large | Mistral | 128K | Text, image | European alternative, strong multilingual | $2 input, $6 output |
| Qwen 2.5 | Alibaba | 128K | Text, image | Multilingual, coding | Free (self-hosted) |
The pricing landscape has compressed dramatically. What cost $60/M tokens in 2023 (GPT-4) now costs $3/M tokens at comparable quality (Claude Sonnet, Gemini Flash). For cost-sensitive workloads, open-source models running on your own hardware bring that cost close to zero.
Multimodal Transformers
Modern transformers don't just process text. The architecture has proven remarkably adaptable to other modalities.
Vision Transformers (ViT)
The Vision Transformer (ViT), introduced by Google in 2020, applies the transformer architecture to images by splitting an image into fixed-size patches (e.g., 16×16 pixels), flattening each patch into a vector, and treating the sequence of patches exactly like a sequence of word tokens. Self-attention then captures relationships between patches, enabling the model to understand spatial relationships, object boundaries, and scene composition.
CLIP and Contrastive Learning
OpenAI's CLIP (Contrastive Language-Image Pre-training) learns to align text and image representations in a shared embedding space. Given an image and a text description, CLIP can determine how well they match — enabling zero-shot image classification, image search, and cross-modal retrieval without task-specific training.
Native Multimodal Models
The latest generation of models processes multiple modalities natively rather than stitching together separate encoders:
- GPT-4o / GPT-5 — processes text, images, and audio in a single model, generating text and audio output
- Claude 4.x — accepts images and text, excels at analyzing code screenshots, diagrams, and document images
- Gemini 2.5 / 3.x — natively multimodal from the ground up, processing text, images, audio, and video in unified conversation
- Qwen-VL — open-source vision-language model capable of operating graphical interfaces and recognizing UI elements
The trend is clear: separate single-modality models are giving way to unified architectures that understand the world through multiple senses simultaneously, much like humans do.
Instruction Tuning and RLHF
A raw, pre-trained transformer is a powerful text predictor but a terrible assistant. It has absorbed the statistical patterns of its training data, but it doesn't know how to follow instructions, maintain a conversation, or refuse harmful requests. Three post-training techniques bridge this gap.
Supervised Fine-Tuning (SFT)
The first step is supervised fine-tuning on a curated dataset of (instruction, response) pairs. Human annotators or AI systems write high-quality examples of desired behavior. The model learns basic formatting, instruction following, and conversational patterns. Think of SFT as teaching the model "this is what a helpful response looks like."
Reinforcement Learning from Human Feedback (RLHF)
RLHF refines the model further using human preferences. The process:
- Generate multiple responses to the same prompt
- Human raters rank the responses from best to worst
- Train a reward model that predicts human preferences
- Use reinforcement learning (PPO algorithm) to optimize the LLM to produce outputs the reward model rates highly
RLHF is how ChatGPT became conversational, helpful, and (mostly) safe. It's effective but complex — training a separate reward model and running RL optimization is unstable and resource-intensive.
Direct Preference Optimization (DPO)
DPO simplifies alignment by eliminating the reward model entirely. Instead of training a separate model, DPO directly optimizes the language model using pairs of preferred/dispreferred responses. It achieves comparable results to RLHF with substantially reduced complexity and is now used by several major LLM providers.
Constitutional AI (CAI)
Anthropic's Constitutional AI takes a different approach: instead of relying entirely on human feedback, the model critiques and revises its own outputs based on a set of predefined principles (a "constitution"). The model generates a response, evaluates it against principles like helpfulness and harmlessness, and rewrites it. This AI-generated feedback is then used for RLHF-style training — a technique called RLAIF (RL from AI Feedback). CAI is central to how Claude models are trained.
Open-Source Models: Llama, Mistral, Qwen, DeepSeek
2025 was the year open-source LLMs closed the gap with proprietary models. In 2026, they're on par — or better — in many domains.
Meta Llama
Meta's Llama family is the most influential open-source LLM series. Llama 3 (2024) came in 8B, 70B, and 405B parameter sizes with a 128K context window. Llama 4 (2025) introduced Mixture of Experts — Scout with 17B active parameters from 109B total and an unprecedented 10M token context window, and Maverick optimized for general-purpose tasks. Licensed under Meta's community license (free for most commercial use under 700M monthly active users).
Mistral AI
The French AI lab Mistral made its name by punching well above its weight class. Mistral 7B outperformed Llama 2 13B on most benchmarks. Mixtral 8x7B introduced an accessible MoE architecture. Mistral Small 3 (early 2025) ships under Apache 2.0 — fully permissive for commercial use. Mistral Large competes with frontier models at lower cost.
Qwen (Alibaba)
Qwen models from Alibaba are among the most versatile open LLMs, especially for multilingual workloads. Qwen 2.5 covers sizes from 0.5B to 72B with strong coding and math capabilities. Qwen 3 (late 2025) expanded to 110B+ parameters. The Qwen3-Coder variant has shown particular strength in code generation, and Qwen-VL handles vision-language tasks including UI interaction.
DeepSeek
DeepSeek gained massive attention in early 2025 when its R1 reasoning model demonstrated ChatGPT-level performance at dramatically lower training costs — the "DeepSeek moment." DeepSeek-V3 and V3.2 built on this with strong code generation and reasoning capabilities. The models use MoE with aggressive cost optimization, making them some of the most cost-efficient options for API usage ($0.27/M input tokens).
Where to Find Models
Hugging Face is the central hub for open-source model weights, with over 1 million models available. Most models are published in multiple quantization formats (GGUF for llama.cpp/Ollama, GPTQ for GPU inference, AWQ for optimized GPU inference).
Which Model for Which Task?
Choosing the right model depends on your task requirements, budget, and infrastructure constraints.
| Task | Recommended Model | Why |
|---|---|---|
| Code generation & debugging | Claude Sonnet 4.6, GPT-5.x | Strongest coding benchmarks, tool use |
| Long document analysis | Gemini 2.5 Pro (1M ctx), Claude (200K ctx) | Massive context windows |
| Creative writing | Claude Opus 4.6, GPT-5.x | Nuanced, instruction-following |
| Data extraction & structured output | GPT-4o, Claude Sonnet | Reliable JSON/function calling |
| Multilingual content | Qwen 2.5, Gemini | Broad language coverage |
| Cost-sensitive high-volume | Gemini Flash, Claude Haiku, GPT-4o-mini | Sub-$1/M token pricing |
| Privacy-sensitive / offline | Llama 3 70B, Mistral Large via Ollama | Runs entirely on your hardware |
| Reasoning-heavy tasks | DeepSeek R1, Claude Opus 4.6 | Chain-of-thought, extended thinking |
| Multimodal (image + text) | GPT-4o, Gemini 2.5 Pro | Native vision understanding |
| Embeddings & semantic search | text-embedding-3-large, voyage-3 | Purpose-built for vector storage |
| Agentic workflows | Claude Sonnet 4.6, GPT-5.x | Tool use, multi-step planning |
Calling the OpenAI API (TypeScript)
import OpenAI from 'openai'
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'You are a helpful coding assistant.' },
{ role: 'user', content: 'Write a TypeScript function to debounce API calls.' },
],
temperature: 0.7,
max_tokens: 1024,
})
console.log(response.choices[0].message.content)Calling the Anthropic API (TypeScript)
import Anthropic from '@anthropic-ai/sdk'
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })
const message = await anthropic.messages.create({
model: 'claude-sonnet-4-6-20250514',
max_tokens: 1024,
messages: [
{
role: 'user',
content: 'Explain the transformer attention mechanism in 3 sentences.',
},
],
})
console.log(message.content[0].text)Running Models Locally
You don't need an API key to use powerful LLMs. Open-source models can run entirely on your hardware — ideal for privacy, cost control, and offline use. For a deeper dive, see our complete guide to local LLM inference tools and choosing the right Mac Mini for local LLMs.
Ollama — The Easiest Path
Ollama is the "Docker for LLMs" — a single binary that downloads, quantizes, and runs models locally with automatic GPU detection and an OpenAI-compatible API.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain transformers in simple terms"
# Use as an API (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-d '{"model": "llama3.1:8b", "messages": [{"role": "user", "content": "Hello"}]}'vLLM — Production-Grade Serving
vLLM is optimized for throughput with its PagedAttention algorithm, reducing memory fragmentation by 40%+ and enabling large batch sizes. Benchmarks show vLLM hitting 793 tokens/second versus Ollama's 41 tokens/second under concurrent load. Use vLLM when serving multiple users.
llama.cpp — Maximum Control
llama.cpp is the C++ inference engine that Ollama builds upon. Use it directly when you need custom compilation flags, hardware-specific optimizations, or the absolute lowest-level control over inference.
Hardware Requirements
| Model Size | Quantization | VRAM Required | Example GPU |
|---|---|---|---|
| 7–8B | Q4_K_M | 5–7 GB | RTX 3060 12GB, M1/M2 16GB |
| 13B | Q4_K_M | 9–11 GB | RTX 3080, M1 Pro 16GB |
| 32–34B | Q4_K_M | 20–24 GB | RTX 3090/4090, M2 Max 32GB |
| 70B | Q4_K_M | 38–42 GB | 2× RTX 3090, M2 Ultra 64GB |
| 70B | Q8_0 | 70+ GB | 2× RTX 4090, M2 Ultra 128GB |
For Apple Silicon users, the unified memory architecture makes running larger models practical. An M2 Max with 32GB can comfortably run 32B models — see our Mac Mini guide for specific recommendations.
Fine-Tuning and Adaptation
When prompting alone doesn't get you the output quality you need, you have three escalation paths.
Prompt Engineering
Start here. Techniques like chain-of-thought prompting, few-shot examples, structured output formatting, and system prompts can dramatically improve results with zero training cost. Many tasks that seem to require fine-tuning can be solved with better prompts.
Retrieval-Augmented Generation (RAG)
Inject relevant context from your own data into the model's prompt at inference time. RAG is ideal when:
- Your data changes frequently (product catalogs, documentation)
- You need verifiable, source-attributed answers
- The knowledge doesn't fit in a fine-tuning dataset
- You want to use the model's general capabilities on your specific data
RAG pairs well with vector databases and embeddings for semantic retrieval.
Fine-Tuning with LoRA / QLoRA
When you need the model to adopt a specific style, format, or domain expertise that prompting can't achieve:
- LoRA (Low-Rank Adaptation) — freezes the base model weights and trains small adapter matrices (typically 0.1-1% of total parameters). Produces lightweight adapters that can be swapped in and out.
- QLoRA — combines LoRA with 4-bit quantization, enabling fine-tuning of a 65B parameter model on a single 48GB GPU. Made fine-tuning accessible to individual developers.
When to fine-tune vs prompt engineer vs RAG:
| Situation | Approach |
|---|---|
| Need specific knowledge or facts | RAG |
| Need specific output format or style | Fine-tune |
| Need general task improvement | Better prompts |
| Data changes frequently | RAG |
| Need maximum quality on narrow domain | Fine-tune + RAG |
| Budget is limited | Prompt engineering first |
The Future: Mixture of Experts, State Space Models
The transformer architecture continues to evolve, with two major trends reshaping the landscape.
Mixture of Experts (MoE)
MoE architectures use a routing network to selectively activate only a subset of the model's parameters for each input token. For example, Mixtral 8x7B has 47B total parameters but only activates ~13B per token — achieving the quality of a much larger dense model at a fraction of the compute cost.
MoE has gone mainstream in 2025-2026:
- GPT-4 was widely reported to use MoE (unconfirmed by OpenAI)
- Llama 4 Scout uses MoE with 17B active from 109B total
- DeepSeek-V3 uses MoE for cost-efficient reasoning
- Mixtral made the technique accessible to the open-source community
The trade-off: MoE models require more total memory (all experts must be loaded) but use less compute per token, making them faster and cheaper at inference time.
State Space Models (SSMs)
State Space Models like Mamba offer an alternative to the attention mechanism. While self-attention has O(n²) complexity with sequence length (every token attends to every other token), SSMs achieve O(n) complexity by maintaining a compressed state that's updated as each token is processed — similar in spirit to RNNs but without the vanishing gradient problem.
SSMs excel at very long sequences where quadratic attention becomes prohibitively expensive. However, pure SSMs have shown slightly weaker performance on tasks requiring precise recall of specific tokens from the input.
Hybrid Architectures
The most promising direction combines both: Jamba (AI21) interleaves transformer attention layers with Mamba SSM layers, getting the best of both architectures — precise attention for short-range dependencies and efficient SSM processing for long-range context. Expect more hybrid architectures in 2026 and beyond.
What's Next?
Several trends are converging:
- Agentic AI — models that plan multi-step tasks, use tools, and act autonomously
- Reasoning models — extended chain-of-thought with verification (DeepSeek R1, OpenAI o1/o3)
- Smaller, smarter models — distillation and architecture improvements making 8B models match 2024's 70B
- Multimodal everything — text, image, audio, video, and code in unified architectures
- On-device inference — running capable models on phones and laptops
The transformer architecture that started as a machine translation improvement in 2017 has become the foundation of the most significant technology shift since the internet. Whether you're building with APIs or running models on your own hardware, understanding how transformers work gives you the foundation to navigate this rapidly evolving landscape.
