What does "B" mean in LLM model names like "8B" or "70B"?

"B" stands for billions of parameters — the trainable weights in a neural network. A 7B model has 7 billion parameters. Larger models generally have more knowledge capacity but require more memory. At Q4 quantization: 7B needs ~4GB RAM, 13B needs ~8GB, 30B needs ~18GB, and 70B needs ~40GB.

What is the difference between Q4_K_M and Q8_0 quantization?

Q4_K_M uses 4-bit precision with K-quant medium blocks — the mainstream default that retains ~92% quality at ~75% size reduction. Q8_0 uses 8-bit precision and is near-lossless but roughly doubles the file size. For most users, Q4_K_M is the sweet spot. Use Q8_0 only if you have plenty of RAM and want maximum quality.

What does "instruct" or "IT" mean in an LLM model name?

IT stands for Instruction Tuned. These models have been fine-tuned on instruction-response pairs so they follow user prompts reliably. A base model is trained on raw text prediction and may not follow instructions well. Always use the instruct/IT variant for chat, coding, and general use — base models are primarily for researchers and fine-tuners.

What is the difference between GGUF and safetensors format?

GGUF is the format for local inference tools like Ollama, LM Studio, and llama.cpp — it packages weights, tokenizer, and metadata into a single portable file optimized for CPU and mixed CPU/GPU inference. Safetensors is HuggingFace's format for GPU inference with vLLM, training, and fine-tuning. If you're running models locally, you want GGUF. If you're deploying on NVIDIA GPUs or fine-tuning, you want safetensors.

How do I choose the right quantization for my hardware?

Q4_K_M is the most common starting point — it retains ~92% quality at ~75% size reduction. If you have extra RAM, try Q5_K_M or Q6_K for better quality. If memory is tight, Q3_K_M trades some quality for a smaller footprint. The general principle: pick the largest model that fits your hardware, then the highest quantization level you can afford. Tools like Ollama and LM Studio make it easy to experiment with different quantization levels.

What does MoE (Mixture of Experts) mean in a model name?

MoE stands for Mixture of Experts — an architecture where only a fraction of the model's total parameters are active per token. For example, Qwen3.5-35B-A3B has 35B total parameters but only activates 3B per token, giving large-model quality at small-model inference speed. The tradeoff: you still need enough RAM for all 35B parameters, but compute cost is based on the 3B active.

What is the difference between a base model and an instruction-tuned model?

A base model is trained on raw text via next-token prediction — it completes text patterns but doesn't follow instructions reliably. An instruction-tuned (IT/instruct) model is further trained on instruction-response pairs to follow user prompts. A chat model adds multi-turn conversation optimization. For general use, always pick instruct or chat variants.

How do I read a Hugging Face model card?

Look for these key sections: Model name format is organization/model-name (e.g., google/gemma-4-4b-it). Check the license (Apache 2.0 is most permissive), parameter count, context length, supported languages, and benchmark scores. The Files tab shows model weights — safetensors files for the original, or look for GGUF variants from quantizers like bartowski or unsloth in separate repos.

How many parameters do I need for coding, chat, or general use?

It depends on your hardware and quality expectations — there's no universal answer. Generally, 7-9B models handle basic chat well, 14B+ models are more capable for coding and reasoning, and 27B+ models produce more nuanced output. But a well-trained smaller model often outperforms a larger one. The best approach is to browse the Ollama library, OpenRouter, or Hugging Face trending models and try a few at your hardware tier.

What is GGUF format and why do local LLM tools use it?

GGUF (GPT-Generated Unified Format) is a single-file format created by the llama.cpp project. It packages model weights, tokenizer, and metadata together for fast loading via memory-mapping. It supports extensive quantization (2-bit to 8-bit) and runs on CPU, GPU, or mixed CPU+GPU. Ollama, LM Studio, KoboldCpp, and llama.cpp all use GGUF as their primary format.

LLM Model Names Decoded: A Developer's Guide to Parameters, Quantization & Formats

TL;DR: "B" = billions of parameters. "IT" = instruction tuned. "Q4_K_M" = 4-bit quantization, a common default. "GGUF" = the format for Ollama and local tools. "MoE" = only a fraction of parameters activate per token. This guide decodes every component of LLM model names, explains quantization formats and file types, and points you to the best resources for researching which model fits your hardware and use case.

If you've ever stared at a Hugging Face model page and seen something like unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF and wondered what any of that means — this guide is for you.

The open-weight model ecosystem has exploded. Gemma 4, Qwen 3.5, Llama 4, DeepSeek, Mistral — every family ships dozens of variants across different sizes, architectures, quantization levels, and file formats. Picking the right one for your hardware and use case shouldn't require a PhD.

I wrote this as a companion to my local LLM inference tools guide, which covers how to run models. This guide explains what all those cryptic suffixes mean and points you toward the best resources for researching which model fits your setup.

Anatomy of a Model Name
Parameters: What the Numbers Mean
Training Variants: Base vs Instruct vs Chat
Quantization Demystified
Model Formats: GGUF vs Safetensors vs Others
Format Compatibility Matrix
Architecture: Dense vs Mixture of Experts
Community Fine-Tunes and Variants
The 2026 Model Landscape
How to Read a Hugging Face Model Card
Decision Framework: Finding the Right Model
Glossary

Anatomy of a Model Name

Let's decode a real model name, piece by piece.

Take bartowski/Qwen3.5-32B-Instruct-GGUF-Q4_K_M:

Component	Value	Meaning
Organization	`bartowski`	Who published this variant (community quantizer)
Family	`Qwen3.5`	Model family and version (Alibaba's Qwen, generation 3.5)
Size	`32B`	32 billion parameters
Training	`Instruct`	Instruction-tuned (follows prompts)
Format	`GGUF`	File format (for Ollama, LM Studio, llama.cpp)
Quantization	`Q4_K_M`	4-bit precision, K-quant method, medium block size

Here's another: google/gemma-4-26B-A4B-it

Component	Value	Meaning
Organization	`google`	Official release from Google
Family	`gemma-4`	Gemma generation 4
Size	`26B-A4B`	26B total params, 4B active (Mixture of Experts)
Training	`it`	Instruction tuned

The general pattern: [Org/] Family-Version-Size [-Active] -Training [-Format] [-Quantization]

Not every model follows this exactly — naming is more convention than standard. But once you know the components, you can decode anything.

Parameters: What the Numbers Mean

The "B" in model names stands for billions of parameters — the trainable numerical weights that a neural network learns during training. More parameters generally means more knowledge capacity, but also more memory required.

Size Tiers

Tier	Parameter Range	RAM Needed (Q4_K_M)	Best For
Tiny	1-3B	2-3 GB	Edge devices, quick tasks, mobile
Small	4-9B	3-6 GB	General chat, summarization, simple coding
Medium	13-14B	8-10 GB	Strong coding, reasoning, creative writing
Large	27-32B	18-22 GB	Complex reasoning, nuanced writing
Extra Large	70B+	40+ GB	Near-frontier quality, research

The rule of thumb for Q4_K_M GGUF: take the parameter count in billions, multiply by roughly 0.6, and that's your approximate file size in GB. A 7B model is ~4GB, a 32B is ~19GB, a 70B is ~40GB.

You'll also see "M" for millions — 278M means 278 million parameters. These are tiny models for embedding, classification, or on-device use.

Bigger Isn't Always Better

A well-trained 14B model frequently outperforms a mediocre 70B. Training data quality, architecture choices, and fine-tuning matter as much as raw parameter count. Phi-4-reasoning at 14B beats DeepSeek-R1 (671B total) on some math benchmarks. Qwen2.5-Coder at 14B scores ~85% on HumanEval, competitive with models 5x its size.

The best way to evaluate this is hands-on experimentation. Browse the Ollama model library, check Hugging Face trending models, or explore what's popular on OpenRouter — then try a few models at your hardware tier and see what works for your workflow.

Further reading: AI Model Parameters Explained · LLM Model Sizes Guide · Phi-4 Reasoning Technical Report

Training Variants: Base vs Instruct vs Chat

When you see -base, -instruct, -it, or -chat in a model name, it tells you how the model was fine-tuned after initial pretraining.

Base (Pretrained)

Trained on massive text corpora via next-token prediction
Completes text patterns but doesn't follow instructions reliably
Like a student who's read every book but hasn't learned to answer exam questions
When to use: Fine-tuning your own model, research, text completion

Instruct / IT (Instruction Tuned)

Fine-tuned on instruction-response pairs (supervised fine-tuning)
Follows user prompts reliably: "Summarize this," "Write a function that..."
The standard variant for most use cases
When to use: Coding, Q&A, summarization, analysis — virtually everything

Chat

Further optimized for multi-turn conversations with RLHF or DPO
Better at maintaining context across a conversation
When to use: Chatbot applications, interactive assistants

Other Training Suffixes

Suffix	Meaning
`-DPO`	Trained with Direct Preference Optimization (alignment technique)
`-RLHF`	Trained with Reinforcement Learning from Human Feedback
`-reasoning` / `-thinking`	Optimized for chain-of-thought reasoning
`-vision` / `-VL`	Supports image input (vision-language)
`-coder`	Fine-tuned specifically for code generation

For general use, always pick the instruct/IT variant. Base models are for researchers and fine-tuners. If you're running a model in Ollama or LM Studio, you want instruct.

Further reading: Base vs Instruct vs Chat Models (Medium) · Foundation vs Instruct vs Thinking Models · Choosing the Right Model (BentoML)

Quantization Demystified

Quantization reduces the numerical precision of model weights — storing each weight in fewer bits. This shrinks file size and speeds up inference at the cost of some accuracy.

Precision Formats

Full-precision models store each weight as a 16-bit or 32-bit floating point number. Quantization compresses these down:

Format	Bits per Weight	Description	Typical Use
FP32	32	Full precision, gold standard	Training reference
BF16	16	Brain Float 16 (same range as FP32, lower precision)	Default for LLM training
FP16	16	Half precision (narrower range than BF16)	GPU inference
FP8	8	8-bit float	Cutting-edge training/inference
INT8	8	8-bit integer, fixed-point	Post-training quantization
INT4 / FP4	4	4-bit, aggressive compression	Local inference on constrained hardware

When you see BF16 or FP16 in a model name, it means the weights are stored at that precision — no quantization applied. These are the highest-quality downloads but also the largest files.

GGUF Quantization Levels

GGUF files use a naming scheme: Q [bits] _ [method] _ [size] — for example, Q4_K_M.

Q = quantized
Number = bits per weight (2, 3, 4, 5, 6, 8)
K = K-quant method (smarter bit allocation across layers)
S / M / L = Small / Medium / Large block size

Level	Bits	Size (7B model)	Quality	Recommendation
Q2_K	2	~2.7 GB	Poor — significant loss	Emergency only
Q3_K_S	3	~2.9 GB	Fair — noticeable degradation	Very constrained hardware
Q3_K_M	3	~3.1 GB	Fair	Tight budgets
Q4_K_S	4	~3.6 GB	Good	Budget hardware
Q4_K_M	4	~3.8 GB	Good — 92% quality retention	The mainstream default
Q5_K_S	5	~4.6 GB	Very good	Between Q4 and Q6
Q5_K_M	5	~4.8 GB	Very good — near-imperceptible loss	When you have extra RAM
Q6_K	6	~5.5 GB	Excellent	Quality-sensitive tasks
Q8_0	8	~7 GB	Near-lossless	When VRAM isn't a concern
F16	16	~14 GB	Perfect	Maximum quality baseline

The sweet spot for most users is Q4_K_M. It's the default quantization in Ollama, retains ~92% of the original model's quality, and cuts file size by roughly 75% compared to FP16.

What K-Quant Actually Does

K-quants use a two-level quantization scheme. Weights are grouped into 32-weight blocks, packed into 256-weight "super-blocks." Per-block scale factors are computed, then those scales are quantized again (double quantization). This preserves more information than naive bit reduction.

The S/M/L suffix controls which layers get extra precision:

S (Small): All tensors at the base bit-width — smallest file
M (Medium): Some attention and feed-forward tensors get higher bit-width — better quality, slightly larger
L (Large): More tensors at higher bit-width — best quality, largest file

For example, Q4_K_M stores most tensors at 4-bit but promotes half of the attention and feed-forward weights to 6-bit.

I-Quants (Importance Matrix)

A newer family of quantization (IQ2_M, IQ3_M, IQ4_XS) uses importance matrices to identify and protect critical weights during quantization. IQ4_XS can compress more aggressively than Q4_K_M with comparable quality. You'll see these from quantizers like unsloth.

GPU-Native Quantization Methods

GGUF isn't the only game in town. If you have an NVIDIA GPU, these formats run faster:

Format	Creator	Key Advantage	Hardware
AWQ	MIT / NVIDIA	Activation-aware, ~95% quality at 4-bit, fastest with Marlin kernel	NVIDIA GPU only
GPTQ	Frantar et al.	First practical LLM quantization, wide tool support	NVIDIA GPU only
EXL2	turboderp	Per-layer mixed bit-widths (2-8 bit), fastest interactive inference	NVIDIA GPU only

These methods produce files stored as safetensors (not GGUF) and run through tools like vLLM, ExLlamaV2, or HuggingFace Transformers. They're GPU-only — no CPU fallback.

When to use what:

On CPU or mixed CPU/GPU → GGUF (Q4_K_M default)
On NVIDIA GPU, maximum throughput → AWQ with Marlin kernel
On NVIDIA GPU, maximum quality-per-byte → EXL2

Further reading: GGUF Quantization Explained (WillItRunAI) · K-Quants and I-Quants Guide · GPTQ vs AWQ vs EXL2 vs llama.cpp · AWQ Paper (MLSys 2024) · Quantization Methods Compared

Model Formats: GGUF vs Safetensors vs Others

The file format determines which tools can load the model. This is one of the most common sources of confusion.

GGUF

Created by: Georgi Gerganov (llama.cpp project)
Extension: .gguf
What it is: A single-file format packaging weights, tokenizer, and metadata. Designed for local inference with extensive quantization support.
Runs on: Ollama, LM Studio, llama.cpp, KoboldCpp
Pros: Single-file portability, CPU-friendly, quantization from 2-bit to 8-bit
Cons: Requires conversion from safetensors, slower than GPU-native formats on NVIDIA

Safetensors

Created by: Hugging Face
Extension: .safetensors
What it is: A secure serialization format — pure data, no executable code. Replaced PyTorch's pickle format which had arbitrary code execution vulnerabilities.
Runs on: vLLM, HuggingFace Transformers, TGI, SGLang
Pros: Secure, fast loading (76x faster than pickle on CPU), the standard for training/fine-tuning
Cons: Full-precision models require substantial VRAM

MLX

Created by: Apple Machine Learning Research
Extension: .safetensors (MLX-converted)
What it is: Apple Silicon-native format leveraging unified memory. No data copying between CPU and GPU.
Runs on: MLX framework, LM Studio (Mac), Ollama (Mac, since March 2026)
Pros: Optimized for Apple Silicon, leverages all system RAM
Cons: Apple Silicon only

Others

Format	Use Case	Note
ONNX	Cross-platform/mobile/browser deployment	Not commonly used for LLMs
TensorRT	Maximum NVIDIA GPU throughput	GPU-architecture-specific, not portable
PyTorch .bin	Legacy	Being replaced by safetensors everywhere

The Key Insight

GGUF is for local inference. If you're using Ollama, LM Studio, or llama.cpp, you need GGUF (or MLX on Mac).

Safetensors is for everything else — GPU inference with vLLM, training, fine-tuning, and as the canonical format on HuggingFace.

You cannot fine-tune from GGUF. If you want to fine-tune, start with the safetensors version, train with LoRA/QLoRA, then convert the result to GGUF for serving.

Further reading: Common AI Model Formats (HuggingFace Blog) · What is GGUF? Complete Guide · Safetensors Security Audit · MLX GitHub · Ollama: Importing Models

Format Compatibility Matrix

Which tools support which formats — at a glance:

Format	Ollama	LM Studio	vLLM	llama.cpp	ExLlamaV2	HF Transformers
GGUF	✅	✅	—	✅	—	—
Safetensors	✅ (auto-converts)	✅	✅	—	—	✅
AWQ	—	—	✅	—	—	✅
GPTQ	—	—	✅	—	✅	✅
EXL2	—	—	—	—	✅	—
MLX	✅ (Mac)	✅ (Mac)	—	—	—	—

Ollama can import safetensors models via a Modelfile and auto-converts them to GGUF. On Apple Silicon, Ollama now uses MLX as its backend (since March 2026).

Architecture: Dense vs Mixture of Experts

You'll see "MoE" in model descriptions and encoded in names like 35B-A3B or 8x7B. This is an architectural choice that fundamentally changes the size-to-performance equation.

Dense Models

Every parameter is used for every token. A 32B dense model activates all 32 billion parameters on every input.

Examples: Gemma 4 31B, Qwen3.5-27B, Llama 3.1 70B
Naming: Just the parameter count — 32B, 70B
RAM required: Proportional to total parameter count

Mixture of Experts (MoE)

The model contains multiple "expert" sub-networks. A router selects only a few experts per token — the rest stay idle.

Examples: Qwen3.5-35B-A3B (35B total, 3B active), Llama 4 Scout (109B total, 17B active)
Naming: Total-B-A-Active-B format (e.g., 35B-A3B) or described in model card
RAM required: Based on total parameters (all experts must be in memory)
Compute cost: Based on active parameters (only selected experts run)

Model	Total Params	Active Params	Experts	Behavior
Qwen3.5-35B-A3B	35B	3B	MoE	Large-model knowledge, small-model speed
Qwen3.5-122B-A10B	122B	10B	MoE	Near-frontier quality
Qwen3.5-397B-A17B	397B	17B	MoE	Frontier-class open model
Llama 4 Scout	109B	17B	16	10M token context window
Llama 4 Maverick	400B	17B	128	Beats GPT-4o on many benchmarks
Gemma 4 26B-A4B	26B	4B	MoE	Near-31B quality at 4B compute
DeepSeek-V3	671B	37B	MoE	Strong coding + general
GLM-5	744B	40B	MoE	MIT licensed, trained on Huawei chips

The tradeoff: An MoE model gives you the knowledge capacity of a much larger model at a fraction of the compute cost per token. But you still need enough RAM to hold all the parameters — the router needs access to every expert, even if it only activates a few at a time.

Practical example: Qwen3.5-35B-A3B has 35B total parameters (needs ~20GB at Q4_K_M) but runs at the speed of a 3B model. Compare that to a 3B dense model that needs ~2GB but has far less knowledge capacity. The MoE trades memory for intelligence.

Further reading: A Visual Guide to Mixture of Experts · MoE LLMs: Key Concepts (Neptune.ai) · NVIDIA MoE Blog

Community Fine-Tunes and Variants

Beyond official releases, a vibrant community creates derivative models. These suffixes tell you what was done:

Common Derivative Suffixes

Suffix	Meaning	Example
-distilled / -Distill	Smaller model trained to mimic a larger "teacher" model	`DeepSeek-R1-Distill-Qwen-32B`
-abliterated	Safety refusal behavior surgically removed post-training	`Llama-3.2-abliterated`
-uncensored	Trained on unfiltered data to remove guardrails	`Dolphin-Mixtral-8x7B`
-reasoning	Optimized for chain-of-thought reasoning	`Phi-4-reasoning`
-LoRA	Fine-tuned with Low-Rank Adaptation (adapter weights only)	Various community models

Key Community Contributors

Name	Role	Known For
bartowski	GGUF quantizer	Most prolific quantizer on HuggingFace — multiple quant levels for every major release
unsloth (Daniel Han)	Fine-tuning framework + quantizer	Dynamic 2.0 quantization with per-layer optimization, 2-5x faster fine-tuning
Nous Research (Teknium)	Fine-tuning lab	Hermes series — premium fine-tunes with minimal content filtering
Eric Hartford	Fine-tuner	Dolphin uncensored model family
TheBloke	GGUF/GPTQ quantizer	Pioneer of community quantization (less active since 2024, bartowski inherited the role)
mlx-community	MLX converters	Pre-converted models for Apple Silicon users

Distillation Explained

Distillation is a technique where a smaller "student" model is trained to replicate a larger "teacher" model's outputs. The most famous example: DeepSeek-R1-Distill-Qwen-32B — a Qwen 2.5 32B model fine-tuned on 800,000 chain-of-thought reasoning samples generated by DeepSeek-R1 (671B). The result outperforms OpenAI o1-mini on multiple benchmarks despite being ~20x smaller.

When you see "-Distill" in a name, it means: this model learned its skills from a bigger model, not just from raw data.

Further reading: Abliteration Explained (HuggingFace Blog) · DeepSeek-R1 Distilled Models · LoRA vs QLoRA (Modal) · Unsloth Dynamic 2.0 GGUFs · bartowski on HuggingFace

The 2026 Model Landscape

The open-weight ecosystem moves fast. Here's where the major families stand as of April 2026.

Gemma 4 (Google) — Apache 2.0

Natively multimodal across all sizes. The 26B MoE achieves near-31B quality with only 4B active parameters.

Model	Params	Architecture	Context	Modalities
Gemma 4 E2B	2.3B	Dense	128K	Text, Image, Video, Audio
Gemma 4 E4B	4.5B	Dense	128K	Text, Image, Video, Audio
Gemma 4 26B-A4B	26B total / 4B active	MoE	256K	Text, Image, Video
Gemma 4 31B	31B	Dense	256K	Text, Image, Video

Best for: Multimodal tasks at any size. The E4B is remarkable — audio, video, and image understanding at 4.5B parameters.

Qwen 3.5 (Alibaba) — Apache 2.0

The widest size range of any model family. Features hybrid thinking/non-thinking mode and a new Gated DeltaNet architecture.

Model	Params	Architecture	Context
Qwen3.5-0.8B	0.8B	Dense	262K
Qwen3.5-4B	4B	Dense	262K
Qwen3.5-9B	9B	Dense	262K
Qwen3.5-27B	27B	Dense	262K
Qwen3.5-35B-A3B	35B / 3B active	MoE	262K
Qwen3.5-122B-A10B	122B / 10B active	MoE	262K
Qwen3.5-397B-A17B	397B / 17B active	MoE	262K

Best for: Versatility. 201 languages, strong coding (Qwen2.5-Coder), and the 35B-A3B MoE runs on 8GB+ VRAM with Q4_K_M quantization. The most popular base for community fine-tunes.

Llama 4 (Meta) — Llama Community License

Meta's first MoE generation. Scout's 10M token context window is industry-leading.

Model	Params	Architecture	Context
Llama 4 Scout	109B / 17B active	MoE (16 experts)	10M
Llama 4 Maverick	400B / 17B active	MoE (128 experts)	1M
Llama 4 Behemoth	~2T / 288B active	MoE (16 experts)	TBD (preview)

Best for: Long context use cases. Scout fits on a single H100 GPU with a 10-million-token window.

Other Notable Families

Family	Key Model	Params	Standout Feature
DeepSeek	R1-Distill-Qwen-32B	32B	Best local reasoning via distillation
Phi-4 (Microsoft)	Phi-4-reasoning	14B	Beats 671B models on math benchmarks
GLM-5 (Zhipu AI)	GLM-5	744B / 40B active	MIT license, trained without NVIDIA chips
Mistral	Mistral Large 3	675B / 41B active	Apache 2.0, strong multilingual
Hermes 4 (Nous)	Hermes 4 405B	405B	Minimal content filtering, strong reasoning
MiniMax	M2	229B / 10B active	$0.26/M input — cheapest frontier-class API

Trends Defining 2026

MoE everywhere. Almost every major release uses Mixture of Experts. The pattern: massive total parameters for knowledge, small active parameters for speed.

Hybrid reasoning. Models like Qwen 3.5 can toggle between fast responses and deep chain-of-thought reasoning in a single model. No separate "thinking" variant needed.

Distillation economy. DeepSeek-R1 proved you can get 80%+ of frontier reasoning in a 7-32B model. Everyone is distilling now.

Context windows keep growing. Llama 4 Scout: 10M tokens. Qwen 3.5: 262K native. Gemma 4: 256K.

The landscape changes quickly — check LMSYS Chatbot Arena for current rankings, and browse OpenRouter or the Ollama library to see what the community is actually using.

Further reading: Gemma 4 Announcement (Google Blog) · Qwen 3.5 on GitHub · Llama 4 Models (Meta) · DeepSeek Complete Guide (BentoML) · GLM-5 Guide · Hermes 4 (Nous Research)

How to Read a Hugging Face Model Card

Hugging Face is where most models live. Here's what to look for on a model page.

Repository Name

Format: organization/model-name

google/gemma-4-4b-it → Official Google release, Gemma 4, 4B params, instruction-tuned
bartowski/Qwen3.5-27B-GGUF → Community GGUF quantization by bartowski
unsloth/DeepSeek-R1-Distill-Llama-8B → Unsloth's optimized version

Key Files

File	What It Is
`README.md`	Model card — architecture, benchmarks, usage, license
`config.json`	Architecture blueprint (layers, vocab size, attention heads)
`model.safetensors`	The actual weights (may be sharded: `model-00001-of-00003.safetensors`)
`tokenizer.json`	Tokenizer definition
`generation_config.json`	Default generation settings (temperature, top_p)

What to Check Before Downloading

License — Apache 2.0 is most permissive. Llama Community License has commercial restrictions above 700M users. Some models restrict commercial use entirely.
Parameter count and architecture — Dense or MoE? How many active parameters?
Context length — How much text can the model process at once?
Quantization available — Check if bartowski or unsloth have GGUF versions in separate repos.
Benchmark scores — Compare against similar-sized models for your use case (MMLU for general knowledge, HumanEval for coding, GSM8K for math).

Finding the Right Variant

If the official repo is google/gemma-4-31b-it (safetensors, full precision), you'll find quantized versions at:

bartowski/gemma-4-31B-it-GGUF — Standard GGUF quantizations
unsloth/gemma-4-31B-it-GGUF — Dynamic quantization variants
mlx-community/gemma-4-31B-it-MLX — Apple Silicon format

Decision Framework: Finding the Right Model

There's no single "best model" for a given hardware setup — it depends on your task, your quality expectations, and how the model was trained, not just parameter count. The landscape changes quickly and new models regularly reshuffle the rankings. Rather than prescribing specific models, here's a framework for how to research and evaluate your options.

Step 1: Know Your Hardware Limits

Your RAM determines the maximum model size you can load. This table shows approximate upper bounds at Q4_K_M quantization:

Your Setup	Approximate Max Size (Q4_K_M)	Where to Explore
8GB RAM	~7B dense, or small MoE	Ollama library — filter by size
16GB RAM / Mac	~14B dense	LM Studio Discover — browse by hardware compatibility
32GB Mac	~32B dense	HuggingFace Models — check model cards for RAM requirements
64GB+ Mac	70B+ dense, large MoE	OpenRouter — try models via API before downloading
NVIDIA 8-12GB VRAM	~9B dense	Ollama library or vLLM with AWQ
NVIDIA 24GB VRAM	~27B dense	Community benchmarks at LocalLLM.in

These are rough guidelines — actual requirements depend on context length, batch size, and the specific model architecture. MoE models need RAM for their full parameter count even though they only activate a fraction per token.

Step 2: Explore What the Community Is Using

The best way to find the right model is to see what others with similar hardware and use cases are running. Here are the best places to research:

Ollama Model Library — Browse popular models, see download counts, and try them with one command. The tags show available sizes and quantizations.
Hugging Face Trending Models — See what's new and popular. Read model cards for benchmarks, hardware requirements, and community feedback.
OpenRouter — Try models via API before committing to a local download. Great for comparing quality across families before choosing one to run locally.
LM Studio — Visual model browser that shows hardware compatibility. Good for beginners exploring what fits their system.
LMSYS Chatbot Arena — Community-voted rankings across hundreds of models. Useful for comparing quality across model families.
LocalLLM.in — Benchmarks specifically for local inference, organized by VRAM tier.

As of April 2026, some of the most popular open-weight model families include Qwen 3.5, Gemma 4, DeepSeek (V3 and R1 distills), GLM-5, MiniMax M2, Kimi K2.5, and Phi-4 — but this list shifts regularly as new models release. Don't take any single recommendation as definitive. Try a few models yourself and evaluate quality for your specific tasks.

Step 3: Which Quantization?

The ladder, from minimum to maximum quality:

You're very memory-constrained → Q3_K_M (noticeable quality loss, but it runs)
Standard recommendation → Q4_K_M (92% quality, fits most setups)
You have extra RAM → Q5_K_M (near-imperceptible loss)
You have plenty of RAM → Q6_K or Q8_0 (effectively lossless)

General rule: prefer a larger model at lower quantization over a smaller model at higher quantization. A 14B at Q4_K_M almost always beats a 7B at Q8_0.

Step 4: Which Format?

Your Tool	Format to Download
Ollama	GGUF (or let Ollama auto-convert)
LM Studio	GGUF or MLX (Mac)
llama.cpp	GGUF
vLLM	Safetensors (or AWQ for GPU quantization)
Fine-tuning	Safetensors (always start with full precision)
Apple Silicon native	MLX

Quick-Start: Trying Models with Ollama

The fastest way to experiment is with Ollama — one command to download and run. Here are some examples to get started, but browse the full Ollama library to see what's currently popular:

# Browse what's available
ollama list

# Try a small model (fits 8GB+ RAM)
ollama run gemma4:4b

# Try a medium model (fits 16GB+ RAM)
ollama run qwen3.5:9b

# Try a larger model (fits 32GB+ RAM)
ollama run qwen3.5:27b

# Specify a quantization level
ollama run qwen3.5:9b-q5_K_M

# See what Ollama downloaded
ollama list

The Ollama library, LM Studio's model browser, and OpenRouter's model list are all good starting points for discovering what's available. Try a few models at your hardware tier, compare the output quality for your specific use case, and see what works best for you.

Glossary

Quick reference for every abbreviation you'll encounter in model names.

Term	Meaning
B	Billions of parameters
M	Millions of parameters
IT / Instruct	Instruction-tuned — fine-tuned to follow prompts
Base	Pretrained only — raw text completion
Chat	Optimized for multi-turn conversation
GGUF	GPT-Generated Unified Format — single-file format for local inference
Safetensors	HuggingFace's secure tensor serialization
Q4_K_M	4-bit K-quant, medium blocks — the mainstream default
Q8_0	8-bit quantization — near-lossless
F16 / FP16	16-bit floating point — half precision
BF16	Brain Float 16 — default training precision
AWQ	Activation-Aware Weight Quantization — GPU-optimized 4-bit
GPTQ	GPT Quantization — early GPU quantization method
EXL2	ExLlamaV2 format — mixed bit-width GPU quantization
MLX	Apple's ML framework for Apple Silicon
MoE	Mixture of Experts — only a fraction of params active per token
Dense	All parameters active on every token
LoRA	Low-Rank Adaptation — efficient fine-tuning method
QLoRA	Quantized LoRA — fine-tuning with 4-bit base model
DPO	Direct Preference Optimization — alignment technique
RLHF	Reinforcement Learning from Human Feedback
Distilled	Trained to mimic a larger model's outputs
Abliterated	Safety refusals surgically removed
VL	Vision-Language — supports image input
A_B suffix	Active parameters in MoE (e.g., A4B = 4B active)
imatrix	Importance matrix — used during quantization for better quality
K-quant	Mixed-precision quantization with importance-based bit allocation
bpw	Bits per weight — average precision across the model

This guide is part of a series on local AI inference. For tool comparisons and hardware recommendations, see Local LLM Inference in 2026: The Complete Guide. For Apple Silicon-specific advice, see Best Mac Mini for Local LLMs.

> Apple Silicon LLM Inference Guide

Get the premium PDF with Apple Silicon chip comparison matrix, GGUF quantization reference card, memory budget calculators, and model name decoder with worked examples.

[Get the Premium Guide — $19]

Sources

Research Papers

arXivPost-Training Quantization for LLMs (2025 Survey)arXivEfficient Weight Quantization for On-Device LLMs arXivLatent Space Factorization in LoRA (2025)arXivMemory-Efficient LLM Finetuning (2025)arXivSurvey on LLM Inference Engines and Optimization arXivSpeculative Decoding: Accelerating LLM Inference (2026)

Resources

Hugging Face Documentation Ollama Model Library OpenRouter Model Directory

Table of Contents