Published on
21 min read

LLM Model Names Decoded: A Developer's Guide to Parameters, Quantization & Formats

Authors

TL;DR: "B" = billions of parameters. "IT" = instruction tuned. "Q4_K_M" = 4-bit quantization, a common default. "GGUF" = the format for Ollama and local tools. "MoE" = only a fraction of parameters activate per token. This guide decodes every component of LLM model names, explains quantization formats and file types, and points you to the best resources for researching which model fits your hardware and use case.

If you've ever stared at a Hugging Face model page and seen something like unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF and wondered what any of that means — this guide is for you.

The open-weight model ecosystem has exploded. Gemma 4, Qwen 3.5, Llama 4, DeepSeek, Mistral — every family ships dozens of variants across different sizes, architectures, quantization levels, and file formats. Picking the right one for your hardware and use case shouldn't require a PhD.

I wrote this as a companion to my local LLM inference tools guide, which covers how to run models. This guide explains what all those cryptic suffixes mean and points you toward the best resources for researching which model fits your setup.

Table of Contents

Anatomy of a Model Name

Let's decode a real model name, piece by piece.

Take bartowski/Qwen3.5-32B-Instruct-GGUF-Q4_K_M:

ComponentValueMeaning
OrganizationbartowskiWho published this variant (community quantizer)
FamilyQwen3.5Model family and version (Alibaba's Qwen, generation 3.5)
Size32B32 billion parameters
TrainingInstructInstruction-tuned (follows prompts)
FormatGGUFFile format (for Ollama, LM Studio, llama.cpp)
QuantizationQ4_K_M4-bit precision, K-quant method, medium block size

Here's another: google/gemma-4-26B-A4B-it

ComponentValueMeaning
OrganizationgoogleOfficial release from Google
Familygemma-4Gemma generation 4
Size26B-A4B26B total params, 4B active (Mixture of Experts)
TrainingitInstruction tuned

The general pattern: [Org/] Family-Version-Size [-Active] -Training [-Format] [-Quantization]

Not every model follows this exactly — naming is more convention than standard. But once you know the components, you can decode anything.

Parameters: What the Numbers Mean

The "B" in model names stands for billions of parameters — the trainable numerical weights that a neural network learns during training. More parameters generally means more knowledge capacity, but also more memory required.

Size Tiers

TierParameter RangeRAM Needed (Q4_K_M)Best For
Tiny1-3B2-3 GBEdge devices, quick tasks, mobile
Small4-9B3-6 GBGeneral chat, summarization, simple coding
Medium13-14B8-10 GBStrong coding, reasoning, creative writing
Large27-32B18-22 GBComplex reasoning, nuanced writing
Extra Large70B+40+ GBNear-frontier quality, research

The rule of thumb for Q4_K_M GGUF: take the parameter count in billions, multiply by roughly 0.6, and that's your approximate file size in GB. A 7B model is ~4GB, a 32B is ~19GB, a 70B is ~40GB.

You'll also see "M" for millions — 278M means 278 million parameters. These are tiny models for embedding, classification, or on-device use.

Bigger Isn't Always Better

A well-trained 14B model frequently outperforms a mediocre 70B. Training data quality, architecture choices, and fine-tuning matter as much as raw parameter count. Phi-4-reasoning at 14B beats DeepSeek-R1 (671B total) on some math benchmarks. Qwen2.5-Coder at 14B scores ~85% on HumanEval, competitive with models 5x its size.

The best way to evaluate this is hands-on experimentation. Browse the Ollama model library, check Hugging Face trending models, or explore what's popular on OpenRouter — then try a few models at your hardware tier and see what works for your workflow.

Further reading: AI Model Parameters Explained · LLM Model Sizes Guide · Phi-4 Reasoning Technical Report

Training Variants: Base vs Instruct vs Chat

When you see -base, -instruct, -it, or -chat in a model name, it tells you how the model was fine-tuned after initial pretraining.

Base (Pretrained)

  • Trained on massive text corpora via next-token prediction
  • Completes text patterns but doesn't follow instructions reliably
  • Like a student who's read every book but hasn't learned to answer exam questions
  • When to use: Fine-tuning your own model, research, text completion

Instruct / IT (Instruction Tuned)

  • Fine-tuned on instruction-response pairs (supervised fine-tuning)
  • Follows user prompts reliably: "Summarize this," "Write a function that..."
  • The standard variant for most use cases
  • When to use: Coding, Q&A, summarization, analysis — virtually everything

Chat

  • Further optimized for multi-turn conversations with RLHF or DPO
  • Better at maintaining context across a conversation
  • When to use: Chatbot applications, interactive assistants

Other Training Suffixes

SuffixMeaning
-DPOTrained with Direct Preference Optimization (alignment technique)
-RLHFTrained with Reinforcement Learning from Human Feedback
-reasoning / -thinkingOptimized for chain-of-thought reasoning
-vision / -VLSupports image input (vision-language)
-coderFine-tuned specifically for code generation

For general use, always pick the instruct/IT variant. Base models are for researchers and fine-tuners. If you're running a model in Ollama or LM Studio, you want instruct.

Further reading: Base vs Instruct vs Chat Models (Medium) · Foundation vs Instruct vs Thinking Models · Choosing the Right Model (BentoML)

Quantization Demystified

Quantization reduces the numerical precision of model weights — storing each weight in fewer bits. This shrinks file size and speeds up inference at the cost of some accuracy.

Precision Formats

Full-precision models store each weight as a 16-bit or 32-bit floating point number. Quantization compresses these down:

FormatBits per WeightDescriptionTypical Use
FP3232Full precision, gold standardTraining reference
BF1616Brain Float 16 (same range as FP32, lower precision)Default for LLM training
FP1616Half precision (narrower range than BF16)GPU inference
FP888-bit floatCutting-edge training/inference
INT888-bit integer, fixed-pointPost-training quantization
INT4 / FP444-bit, aggressive compressionLocal inference on constrained hardware

When you see BF16 or FP16 in a model name, it means the weights are stored at that precision — no quantization applied. These are the highest-quality downloads but also the largest files.

GGUF Quantization Levels

GGUF files use a naming scheme: Q [bits] _ [method] _ [size] — for example, Q4_K_M.

  • Q = quantized
  • Number = bits per weight (2, 3, 4, 5, 6, 8)
  • K = K-quant method (smarter bit allocation across layers)
  • S / M / L = Small / Medium / Large block size
LevelBitsSize (7B model)QualityRecommendation
Q2_K2~2.7 GBPoor — significant lossEmergency only
Q3_K_S3~2.9 GBFair — noticeable degradationVery constrained hardware
Q3_K_M3~3.1 GBFairTight budgets
Q4_K_S4~3.6 GBGoodBudget hardware
Q4_K_M4~3.8 GBGood — 92% quality retentionThe mainstream default
Q5_K_S5~4.6 GBVery goodBetween Q4 and Q6
Q5_K_M5~4.8 GBVery good — near-imperceptible lossWhen you have extra RAM
Q6_K6~5.5 GBExcellentQuality-sensitive tasks
Q8_08~7 GBNear-losslessWhen VRAM isn't a concern
F1616~14 GBPerfectMaximum quality baseline

The sweet spot for most users is Q4_K_M. It's the default quantization in Ollama, retains ~92% of the original model's quality, and cuts file size by roughly 75% compared to FP16.

What K-Quant Actually Does

K-quants use a two-level quantization scheme. Weights are grouped into 32-weight blocks, packed into 256-weight "super-blocks." Per-block scale factors are computed, then those scales are quantized again (double quantization). This preserves more information than naive bit reduction.

The S/M/L suffix controls which layers get extra precision:

  • S (Small): All tensors at the base bit-width — smallest file
  • M (Medium): Some attention and feed-forward tensors get higher bit-width — better quality, slightly larger
  • L (Large): More tensors at higher bit-width — best quality, largest file

For example, Q4_K_M stores most tensors at 4-bit but promotes half of the attention and feed-forward weights to 6-bit.

I-Quants (Importance Matrix)

A newer family of quantization (IQ2_M, IQ3_M, IQ4_XS) uses importance matrices to identify and protect critical weights during quantization. IQ4_XS can compress more aggressively than Q4_K_M with comparable quality. You'll see these from quantizers like unsloth.

GPU-Native Quantization Methods

GGUF isn't the only game in town. If you have an NVIDIA GPU, these formats run faster:

FormatCreatorKey AdvantageHardware
AWQMIT / NVIDIAActivation-aware, ~95% quality at 4-bit, fastest with Marlin kernelNVIDIA GPU only
GPTQFrantar et al.First practical LLM quantization, wide tool supportNVIDIA GPU only
EXL2turboderpPer-layer mixed bit-widths (2-8 bit), fastest interactive inferenceNVIDIA GPU only

These methods produce files stored as safetensors (not GGUF) and run through tools like vLLM, ExLlamaV2, or HuggingFace Transformers. They're GPU-only — no CPU fallback.

When to use what:

  • On CPU or mixed CPU/GPU → GGUF (Q4_K_M default)
  • On NVIDIA GPU, maximum throughput → AWQ with Marlin kernel
  • On NVIDIA GPU, maximum quality-per-byte → EXL2

Further reading: GGUF Quantization Explained (WillItRunAI) · K-Quants and I-Quants Guide · GPTQ vs AWQ vs EXL2 vs llama.cpp · AWQ Paper (MLSys 2024) · Quantization Methods Compared

Model Formats: GGUF vs Safetensors vs Others

The file format determines which tools can load the model. This is one of the most common sources of confusion.

GGUF

  • Created by: Georgi Gerganov (llama.cpp project)
  • Extension: .gguf
  • What it is: A single-file format packaging weights, tokenizer, and metadata. Designed for local inference with extensive quantization support.
  • Runs on: Ollama, LM Studio, llama.cpp, KoboldCpp
  • Pros: Single-file portability, CPU-friendly, quantization from 2-bit to 8-bit
  • Cons: Requires conversion from safetensors, slower than GPU-native formats on NVIDIA

Safetensors

  • Created by: Hugging Face
  • Extension: .safetensors
  • What it is: A secure serialization format — pure data, no executable code. Replaced PyTorch's pickle format which had arbitrary code execution vulnerabilities.
  • Runs on: vLLM, HuggingFace Transformers, TGI, SGLang
  • Pros: Secure, fast loading (76x faster than pickle on CPU), the standard for training/fine-tuning
  • Cons: Full-precision models require substantial VRAM

MLX

  • Created by: Apple Machine Learning Research
  • Extension: .safetensors (MLX-converted)
  • What it is: Apple Silicon-native format leveraging unified memory. No data copying between CPU and GPU.
  • Runs on: MLX framework, LM Studio (Mac), Ollama (Mac, since March 2026)
  • Pros: Optimized for Apple Silicon, leverages all system RAM
  • Cons: Apple Silicon only

Others

FormatUse CaseNote
ONNXCross-platform/mobile/browser deploymentNot commonly used for LLMs
TensorRTMaximum NVIDIA GPU throughputGPU-architecture-specific, not portable
PyTorch .binLegacyBeing replaced by safetensors everywhere

The Key Insight

GGUF is for local inference. If you're using Ollama, LM Studio, or llama.cpp, you need GGUF (or MLX on Mac).

Safetensors is for everything else — GPU inference with vLLM, training, fine-tuning, and as the canonical format on HuggingFace.

You cannot fine-tune from GGUF. If you want to fine-tune, start with the safetensors version, train with LoRA/QLoRA, then convert the result to GGUF for serving.

Further reading: Common AI Model Formats (HuggingFace Blog) · What is GGUF? Complete Guide · Safetensors Security Audit · MLX GitHub · Ollama: Importing Models

Format Compatibility Matrix

Which tools support which formats — at a glance:

FormatOllamaLM StudiovLLMllama.cppExLlamaV2HF Transformers
GGUF
Safetensors✅ (auto-converts)
AWQ
GPTQ
EXL2
MLX✅ (Mac)✅ (Mac)

Ollama can import safetensors models via a Modelfile and auto-converts them to GGUF. On Apple Silicon, Ollama now uses MLX as its backend (since March 2026).

Architecture: Dense vs Mixture of Experts

You'll see "MoE" in model descriptions and encoded in names like 35B-A3B or 8x7B. This is an architectural choice that fundamentally changes the size-to-performance equation.

Dense Models

Every parameter is used for every token. A 32B dense model activates all 32 billion parameters on every input.

  • Examples: Gemma 4 31B, Qwen3.5-27B, Llama 3.1 70B
  • Naming: Just the parameter count — 32B, 70B
  • RAM required: Proportional to total parameter count

Mixture of Experts (MoE)

The model contains multiple "expert" sub-networks. A router selects only a few experts per token — the rest stay idle.

  • Examples: Qwen3.5-35B-A3B (35B total, 3B active), Llama 4 Scout (109B total, 17B active)
  • Naming: Total-B-A-Active-B format (e.g., 35B-A3B) or described in model card
  • RAM required: Based on total parameters (all experts must be in memory)
  • Compute cost: Based on active parameters (only selected experts run)
ModelTotal ParamsActive ParamsExpertsBehavior
Qwen3.5-35B-A3B35B3BMoELarge-model knowledge, small-model speed
Qwen3.5-122B-A10B122B10BMoENear-frontier quality
Qwen3.5-397B-A17B397B17BMoEFrontier-class open model
Llama 4 Scout109B17B1610M token context window
Llama 4 Maverick400B17B128Beats GPT-4o on many benchmarks
Gemma 4 26B-A4B26B4BMoENear-31B quality at 4B compute
DeepSeek-V3671B37BMoEStrong coding + general
GLM-5744B40BMoEMIT licensed, trained on Huawei chips

The tradeoff: An MoE model gives you the knowledge capacity of a much larger model at a fraction of the compute cost per token. But you still need enough RAM to hold all the parameters — the router needs access to every expert, even if it only activates a few at a time.

Practical example: Qwen3.5-35B-A3B has 35B total parameters (needs ~20GB at Q4_K_M) but runs at the speed of a 3B model. Compare that to a 3B dense model that needs ~2GB but has far less knowledge capacity. The MoE trades memory for intelligence.

Further reading: A Visual Guide to Mixture of Experts · MoE LLMs: Key Concepts (Neptune.ai) · NVIDIA MoE Blog

Community Fine-Tunes and Variants

Beyond official releases, a vibrant community creates derivative models. These suffixes tell you what was done:

Common Derivative Suffixes

SuffixMeaningExample
-distilled / -DistillSmaller model trained to mimic a larger "teacher" modelDeepSeek-R1-Distill-Qwen-32B
-abliteratedSafety refusal behavior surgically removed post-trainingLlama-3.2-abliterated
-uncensoredTrained on unfiltered data to remove guardrailsDolphin-Mixtral-8x7B
-reasoningOptimized for chain-of-thought reasoningPhi-4-reasoning
-LoRAFine-tuned with Low-Rank Adaptation (adapter weights only)Various community models

Key Community Contributors

NameRoleKnown For
bartowskiGGUF quantizerMost prolific quantizer on HuggingFace — multiple quant levels for every major release
unsloth (Daniel Han)Fine-tuning framework + quantizerDynamic 2.0 quantization with per-layer optimization, 2-5x faster fine-tuning
Nous Research (Teknium)Fine-tuning labHermes series — premium fine-tunes with minimal content filtering
Eric HartfordFine-tunerDolphin uncensored model family
TheBlokeGGUF/GPTQ quantizerPioneer of community quantization (less active since 2024, bartowski inherited the role)
mlx-communityMLX convertersPre-converted models for Apple Silicon users

Distillation Explained

Distillation is a technique where a smaller "student" model is trained to replicate a larger "teacher" model's outputs. The most famous example: DeepSeek-R1-Distill-Qwen-32B — a Qwen 2.5 32B model fine-tuned on 800,000 chain-of-thought reasoning samples generated by DeepSeek-R1 (671B). The result outperforms OpenAI o1-mini on multiple benchmarks despite being ~20x smaller.

When you see "-Distill" in a name, it means: this model learned its skills from a bigger model, not just from raw data.

Further reading: Abliteration Explained (HuggingFace Blog) · DeepSeek-R1 Distilled Models · LoRA vs QLoRA (Modal) · Unsloth Dynamic 2.0 GGUFs · bartowski on HuggingFace

The 2026 Model Landscape

The open-weight ecosystem moves fast. Here's where the major families stand as of April 2026.

Gemma 4 (Google) — Apache 2.0

Natively multimodal across all sizes. The 26B MoE achieves near-31B quality with only 4B active parameters.

ModelParamsArchitectureContextModalities
Gemma 4 E2B2.3BDense128KText, Image, Video, Audio
Gemma 4 E4B4.5BDense128KText, Image, Video, Audio
Gemma 4 26B-A4B26B total / 4B activeMoE256KText, Image, Video
Gemma 4 31B31BDense256KText, Image, Video

Best for: Multimodal tasks at any size. The E4B is remarkable — audio, video, and image understanding at 4.5B parameters.

Qwen 3.5 (Alibaba) — Apache 2.0

The widest size range of any model family. Features hybrid thinking/non-thinking mode and a new Gated DeltaNet architecture.

ModelParamsArchitectureContext
Qwen3.5-0.8B0.8BDense262K
Qwen3.5-4B4BDense262K
Qwen3.5-9B9BDense262K
Qwen3.5-27B27BDense262K
Qwen3.5-35B-A3B35B / 3B activeMoE262K
Qwen3.5-122B-A10B122B / 10B activeMoE262K
Qwen3.5-397B-A17B397B / 17B activeMoE262K

Best for: Versatility. 201 languages, strong coding (Qwen2.5-Coder), and the 35B-A3B MoE runs on 8GB+ VRAM with Q4_K_M quantization. The most popular base for community fine-tunes.

Llama 4 (Meta) — Llama Community License

Meta's first MoE generation. Scout's 10M token context window is industry-leading.

ModelParamsArchitectureContext
Llama 4 Scout109B / 17B activeMoE (16 experts)10M
Llama 4 Maverick400B / 17B activeMoE (128 experts)1M
Llama 4 Behemoth~2T / 288B activeMoE (16 experts)TBD (preview)

Best for: Long context use cases. Scout fits on a single H100 GPU with a 10-million-token window.

Other Notable Families

FamilyKey ModelParamsStandout Feature
DeepSeekR1-Distill-Qwen-32B32BBest local reasoning via distillation
Phi-4 (Microsoft)Phi-4-reasoning14BBeats 671B models on math benchmarks
GLM-5 (Zhipu AI)GLM-5744B / 40B activeMIT license, trained without NVIDIA chips
MistralMistral Large 3675B / 41B activeApache 2.0, strong multilingual
Hermes 4 (Nous)Hermes 4 405B405BMinimal content filtering, strong reasoning
MiniMaxM2229B / 10B active$0.26/M input — cheapest frontier-class API

MoE everywhere. Almost every major release uses Mixture of Experts. The pattern: massive total parameters for knowledge, small active parameters for speed.

Hybrid reasoning. Models like Qwen 3.5 can toggle between fast responses and deep chain-of-thought reasoning in a single model. No separate "thinking" variant needed.

Distillation economy. DeepSeek-R1 proved you can get 80%+ of frontier reasoning in a 7-32B model. Everyone is distilling now.

Context windows keep growing. Llama 4 Scout: 10M tokens. Qwen 3.5: 262K native. Gemma 4: 256K.

The landscape changes quickly — check LMSYS Chatbot Arena for current rankings, and browse OpenRouter or the Ollama library to see what the community is actually using.

Further reading: Gemma 4 Announcement (Google Blog) · Qwen 3.5 on GitHub · Llama 4 Models (Meta) · DeepSeek Complete Guide (BentoML) · GLM-5 Guide · Hermes 4 (Nous Research)

How to Read a Hugging Face Model Card

Hugging Face is where most models live. Here's what to look for on a model page.

Repository Name

Format: organization/model-name

  • google/gemma-4-4b-it → Official Google release, Gemma 4, 4B params, instruction-tuned
  • bartowski/Qwen3.5-27B-GGUF → Community GGUF quantization by bartowski
  • unsloth/DeepSeek-R1-Distill-Llama-8B → Unsloth's optimized version

Key Files

FileWhat It Is
README.mdModel card — architecture, benchmarks, usage, license
config.jsonArchitecture blueprint (layers, vocab size, attention heads)
model.safetensorsThe actual weights (may be sharded: model-00001-of-00003.safetensors)
tokenizer.jsonTokenizer definition
generation_config.jsonDefault generation settings (temperature, top_p)

What to Check Before Downloading

  1. License — Apache 2.0 is most permissive. Llama Community License has commercial restrictions above 700M users. Some models restrict commercial use entirely.
  2. Parameter count and architecture — Dense or MoE? How many active parameters?
  3. Context length — How much text can the model process at once?
  4. Quantization available — Check if bartowski or unsloth have GGUF versions in separate repos.
  5. Benchmark scores — Compare against similar-sized models for your use case (MMLU for general knowledge, HumanEval for coding, GSM8K for math).

Finding the Right Variant

If the official repo is google/gemma-4-31b-it (safetensors, full precision), you'll find quantized versions at:

  • bartowski/gemma-4-31B-it-GGUF — Standard GGUF quantizations
  • unsloth/gemma-4-31B-it-GGUF — Dynamic quantization variants
  • mlx-community/gemma-4-31B-it-MLX — Apple Silicon format

Decision Framework: Finding the Right Model

There's no single "best model" for a given hardware setup — it depends on your task, your quality expectations, and how the model was trained, not just parameter count. The landscape changes quickly and new models regularly reshuffle the rankings. Rather than prescribing specific models, here's a framework for how to research and evaluate your options.

Step 1: Know Your Hardware Limits

Your RAM determines the maximum model size you can load. This table shows approximate upper bounds at Q4_K_M quantization:

Your SetupApproximate Max Size (Q4_K_M)Where to Explore
8GB RAM~7B dense, or small MoEOllama library — filter by size
16GB RAM / Mac~14B denseLM Studio Discover — browse by hardware compatibility
32GB Mac~32B denseHuggingFace Models — check model cards for RAM requirements
64GB+ Mac70B+ dense, large MoEOpenRouter — try models via API before downloading
NVIDIA 8-12GB VRAM~9B denseOllama library or vLLM with AWQ
NVIDIA 24GB VRAM~27B denseCommunity benchmarks at LocalLLM.in

These are rough guidelines — actual requirements depend on context length, batch size, and the specific model architecture. MoE models need RAM for their full parameter count even though they only activate a fraction per token.

Step 2: Explore What the Community Is Using

The best way to find the right model is to see what others with similar hardware and use cases are running. Here are the best places to research:

  • Ollama Model Library — Browse popular models, see download counts, and try them with one command. The tags show available sizes and quantizations.
  • Hugging Face Trending Models — See what's new and popular. Read model cards for benchmarks, hardware requirements, and community feedback.
  • OpenRouter — Try models via API before committing to a local download. Great for comparing quality across families before choosing one to run locally.
  • LM Studio — Visual model browser that shows hardware compatibility. Good for beginners exploring what fits their system.
  • LMSYS Chatbot Arena — Community-voted rankings across hundreds of models. Useful for comparing quality across model families.
  • LocalLLM.in — Benchmarks specifically for local inference, organized by VRAM tier.

As of April 2026, some of the most popular open-weight model families include Qwen 3.5, Gemma 4, DeepSeek (V3 and R1 distills), GLM-5, MiniMax M2, Kimi K2.5, and Phi-4 — but this list shifts regularly as new models release. Don't take any single recommendation as definitive. Try a few models yourself and evaluate quality for your specific tasks.

Step 3: Which Quantization?

The ladder, from minimum to maximum quality:

  1. You're very memory-constrained → Q3_K_M (noticeable quality loss, but it runs)
  2. Standard recommendationQ4_K_M (92% quality, fits most setups)
  3. You have extra RAM → Q5_K_M (near-imperceptible loss)
  4. You have plenty of RAM → Q6_K or Q8_0 (effectively lossless)

General rule: prefer a larger model at lower quantization over a smaller model at higher quantization. A 14B at Q4_K_M almost always beats a 7B at Q8_0.

Step 4: Which Format?

Your ToolFormat to Download
OllamaGGUF (or let Ollama auto-convert)
LM StudioGGUF or MLX (Mac)
llama.cppGGUF
vLLMSafetensors (or AWQ for GPU quantization)
Fine-tuningSafetensors (always start with full precision)
Apple Silicon nativeMLX

Quick-Start: Trying Models with Ollama

The fastest way to experiment is with Ollama — one command to download and run. Here are some examples to get started, but browse the full Ollama library to see what's currently popular:

# Browse what's available
ollama list

# Try a small model (fits 8GB+ RAM)
ollama run gemma4:4b

# Try a medium model (fits 16GB+ RAM)
ollama run qwen3.5:9b

# Try a larger model (fits 32GB+ RAM)
ollama run qwen3.5:27b

# Specify a quantization level
ollama run qwen3.5:9b-q5_K_M

# See what Ollama downloaded
ollama list

The Ollama library, LM Studio's model browser, and OpenRouter's model list are all good starting points for discovering what's available. Try a few models at your hardware tier, compare the output quality for your specific use case, and see what works best for you.

Glossary

Quick reference for every abbreviation you'll encounter in model names.

TermMeaning
BBillions of parameters
MMillions of parameters
IT / InstructInstruction-tuned — fine-tuned to follow prompts
BasePretrained only — raw text completion
ChatOptimized for multi-turn conversation
GGUFGPT-Generated Unified Format — single-file format for local inference
SafetensorsHuggingFace's secure tensor serialization
Q4_K_M4-bit K-quant, medium blocks — the mainstream default
Q8_08-bit quantization — near-lossless
F16 / FP1616-bit floating point — half precision
BF16Brain Float 16 — default training precision
AWQActivation-Aware Weight Quantization — GPU-optimized 4-bit
GPTQGPT Quantization — early GPU quantization method
EXL2ExLlamaV2 format — mixed bit-width GPU quantization
MLXApple's ML framework for Apple Silicon
MoEMixture of Experts — only a fraction of params active per token
DenseAll parameters active on every token
LoRALow-Rank Adaptation — efficient fine-tuning method
QLoRAQuantized LoRA — fine-tuning with 4-bit base model
DPODirect Preference Optimization — alignment technique
RLHFReinforcement Learning from Human Feedback
DistilledTrained to mimic a larger model's outputs
AbliteratedSafety refusals surgically removed
VLVision-Language — supports image input
A_B suffixActive parameters in MoE (e.g., A4B = 4B active)
imatrixImportance matrix — used during quantization for better quality
K-quantMixed-precision quantization with importance-based bit allocation
bpwBits per weight — average precision across the model

This guide is part of a series on local AI inference. For tool comparisons and hardware recommendations, see Local LLM Inference in 2026: The Complete Guide. For Apple Silicon-specific advice, see Best Mac Mini for Local LLMs.