Is MLX faster than llama.cpp on Apple Silicon?

Yes — MLX is 20-87% faster for token generation on models under 14B parameters. Ollama 0.19 with the MLX backend showed 93% faster decode speed on an M4 Max. However, the advantage collapses at 27B+ models where memory bandwidth becomes the sole bottleneck. For long-context workloads, llama.cpp is faster because MLX does full prefill before emitting any tokens, causing high time-to-first-token at long inputs.

What quantization should I use for local LLMs on Mac?

Q4_K_M is the best all-around choice — only 3.3% perplexity increase versus FP16 with 75% size reduction. Q5_K_M is the quality sweet spot at 1.1% perplexity increase if you have the RAM. Q8_0 is essentially lossless at 0.14% increase. Always use K-quants over legacy formats — Q4_K_M is dramatically better than Q4_0 at the same file size. For maximum quality at each size, use Unsloth Dynamic 2.0 GGUFs.

Are MoE models more efficient on Apple Silicon?

MoE models save compute but NOT memory — all expert weights must be loaded into RAM regardless of how many are active per token. However, MoE models generate tokens faster because fewer parameters are computed per token. Qwen 3.5 35B-A3B and Gemma 4 26B-A4B are the standout MoE models for constrained hardware — only 3-4B active parameters while delivering 30B-class quality.

What is the best model for a 32GB Mac Mini?

There is no single best model — it depends on your use case. For general chat and coding, Qwen 3.5 9B at Q5_K_M offers excellent quality with room for long context. For reasoning, DeepSeek R1 Distill 14B Q4_K_M. For maximum capability, Qwen 3.5 35B-A3B (MoE) fits at Q4 and gives 35B-class quality at near-8B speed. OpenAI gpt-oss-20b is another strong option at Q4. Do not attempt 70B models on 32GB — they require SSD swapping which is destructive.

Does vLLM work on Apple Silicon?

Yes, through two projects: vllm-mlx (independent, built on MLX, supports OpenAI + Anthropic APIs, multimodal) and vllm-metal (official vLLM plugin, text-only). vllm-mlx achieves 21-87% higher throughput than llama.cpp with continuous batching support. Both are sub-v1.0 but actively maintained. For multi-user serving on Mac, vllm-mlx is the more mature option.

What is the memory bandwidth bottleneck for LLM inference?

LLM token generation reads the entire model from memory once per token, making it memory-bandwidth-bound. The formula is: max tokens/sec = bandwidth / model size. M4 at 120 GB/s can theoretically do ~30 tok/s on a 4GB model. M4 Pro at 273 GB/s doubles that. M4 Max at 546 GB/s doubles again. Quantization directly multiplies throughput — Q4 is 4x faster than FP16 because you move 4x less data.

Apple Silicon LLM Inference Optimization: The Complete Guide to Maximum Performance

Q: How much RAM do I need to run local LLMs on Apple Silicon?

At Q4_K_M quantization: 1B models need 0.6GB, 7-8B need 4-5GB, 13-14B need 8GB, 27B needs 15GB, 32B needs 18GB, and 70B needs 40GB. Keep model weights under 60% of total unified memory to leave room for KV cache and macOS. On 32GB, your practical limit is about 20GB for model weights — enough for 27B Q4 or 14B Q8.

TL;DR: MLX is 20-87% faster than llama.cpp for generation on Apple Silicon (under 14B params). Use Ollama 0.19+ with the MLX backend for 93% faster decode with zero config. Q4_K_M is the sweet spot quantization (3.3% quality loss, 75% size reduction). On a 32GB Mac, top picks include Qwen 3.5 9B (daily driver), DeepSeek R1 Distill 14B (reasoning), Qwen 3.5 35B-A3B (MoE), and OpenAI gpt-oss-20b — but the "best" model depends on your use case, context length needs, and quality tolerance. Memory bandwidth is your bottleneck — not compute, not VRAM, not GPU cores. This guide covers every optimization that matters.

I run a 32GB M4 Mac Mini as my local inference box. After weeks of benchmarking different engines, quantization levels, models, and optimization techniques, I've compiled everything into one reference. The Apple Silicon inference ecosystem has matured dramatically in 2026 — MLX is no longer experimental, Ollama ships an MLX backend, and vLLM has two competing Apple Silicon ports. The performance gap between local and cloud is narrowing fast for mid-size models.

This guide is specifically about optimization — squeezing maximum tokens per second at maximum quality from Apple Silicon hardware. If you need a broader overview of inference tools and hardware, see my local LLM inference guide and Mac Mini buying guide.

The Bandwidth Bottleneck
MLX vs llama.cpp vs Ollama: Benchmarks
Quantization Deep Dive
RAM Requirements by Model Size
What Fits on 32GB
Choosing a Model: It Depends
MoE vs Dense Models
vLLM on Apple Silicon
Optimization Techniques
Best Models for Every Mac Configuration
The M4 Chip Architecture
Unsloth: Training Tool, Quantization Game-Changer
Sources

> Apple Silicon LLM Inference Guide

Get the premium PDF with 6 exclusive sections — Apple Silicon Chip Matrix (M1–M4), GGUF Quantization Reference Card, Memory Budget Calculators, Setup Cheat Sheet, Model Name Decoder, and Glossary.

[Get the Premium Guide — $19]

The Bandwidth Bottleneck

LLM token generation is memory-bandwidth-bound, not compute-bound. Every generated token requires reading the entire model's weights from memory once. This means your tokens-per-second ceiling is a simple formula:

Max tok/s = Memory Bandwidth (GB/s) ÷ Model Size in Memory (GB)

Applied to M4-series chips:

Chip	Bandwidth	7B Q4 (~4GB)	14B Q4 (~8GB)	32B Q4 (~18GB)	70B Q4 (~40GB)
M4	120 GB/s	~30 tok/s	~15 tok/s	~6.7 tok/s	N/A
M4 Pro	273 GB/s	~68 tok/s	~34 tok/s	~15 tok/s	~7 tok/s
M4 Max	546 GB/s	~136 tok/s	~68 tok/s	~30 tok/s	~14 tok/s

Real-world numbers hit 60-80% of theoretical due to KV cache reads, attention computation, and kernel overhead. But the relationship holds: quantization is a direct multiplier. Going from FP16 to Q4 gives you 4x throughput because you move 4x less data per token.

This is why buying more GPU cores barely helps. The M4 Pro's 16 GPU cores vs the base M4's 10 cores is a 60% increase in compute — but its 2.3x increase in memory bandwidth is what actually drives the 2x+ inference speedup.

MLX vs llama.cpp vs Ollama: Benchmarks

The Apple Silicon inference landscape shifted in 2026. MLX went from experimental to the fastest option for most workloads.

Generation Speed (decode tok/s)

M4 Max (128GB):

Model	Quant	MLX	llama.cpp	MLX Advantage
Qwen3-0.6B	4-bit	525.5	281.5	+87%
Llama-3.2-1B	4-bit	461.9	331.3	+39%
Qwen3-4B	4-bit	159.0	118.2	+35%
Qwen3-8B	4-bit	93.3	76.9	+21%
Qwen2.5-27B	4-bit	~14	~14	Tied

Source: Groundy benchmarks

Key pattern: MLX leads by 20-87% for models under ~14B. The advantage collapses at 27B+ where memory bandwidth saturates and both frameworks hit the same ceiling.

Ollama 0.19 MLX Backend vs llama.cpp Backend

Ollama 0.19 added an MLX backend that auto-activates on Macs with 32GB+ RAM:

Metric	llama.cpp (0.18)	MLX (0.19)	Improvement
Prefill	1,147 tok/s	1,804 tok/s	+57%
Decode	57.8 tok/s	111.4 tok/s	+93%
Total duration	4.2s	2.3s	-45%

Model: Qwen3.5-35B-A3B on M4 Max 64GB. Source: DEV Community

To enable: update to Ollama 0.19+ and set the environment variable:

export OLLAMA_MLX=1
ollama run qwen3:14b

The MLX backend requires 32GB+ unified memory. Base M4 models with 8-16GB will not activate it.

The Prefill Problem

MLX has a critical weakness: it performs full prefill before emitting any tokens. This means time-to-first-token (TTFT) rises linearly with input length.

Benchmark at 8.5K context (M1 Max, Qwen3.5-35B-A3B):

Phase	MLX	GGUF (llama.cpp)
Prefill time	49.4s	37.8s
Prefill as % of total	94%	87%
Effective throughput	3 tok/s	Faster overall

Source: famstack.dev

The UI reported 51 tok/s for MLX generation, but effective wall-clock throughput was only 3 tok/s because 94% of time was spent prefilling.

When to Use Which

Scenario	Winner	Why
Short input, long output (writing, code gen)	MLX	Higher sustained generation tok/s
Long input, short output (RAG, Q&A, classification)	llama.cpp (GGUF)	Faster prefill, lower TTFT
Easiest setup	Ollama 0.19+	Auto-MLX on 32GB+, zero config
Model doesn't fit in memory	llama.cpp	CPU/GPU layer splitting (no MLX equivalent)
Multi-user serving	vllm-mlx	Continuous batching, 2-3.4x scaling

Local LLM Inference in 2026: The Complete Guide

Broader overview of 10 inference tools, hardware at every budget, and the open-weight model ecosystem.

Quantization Deep Dive

K-Quants: Why They Exist

K-quants (Q4_K_M, Q5_K_M, Q6_K) replaced legacy quantization (Q4_0, Q4_1) in llama.cpp. The difference is architectural:

Feature	Legacy (Q4_0)	K-quants (Q4_K_M)
Block structure	Flat, single-level	Hierarchical super-blocks (256 values)
Scale quantization	One scale per block	Quantized scales + mins per sub-block
Bit allocation	Uniform across all layers	Variable precision by layer sensitivity
File size (7B)	3.50 GB	3.80 GB
Quality (ppl delta)	+0.2499	+0.0535

K-quants deliver 3-4x less perplexity increase at the same file size. The llama.cpp maintainers now recommend Q3_K_M over Q4_0 — the 3-bit K-quant is better than the 4-bit legacy format.

The S/M/L suffixes control metadata overhead: S (smaller, faster), M (balanced — use this), L (more metadata, best reconstruction).

Quality Comparison by Quantization Level

Measured on Llama-3.1-8B-Instruct, WikiText-2 perplexity:

Quant	Bits/Weight	Perplexity	Delta from FP16	Quality Assessment
FP16	16.0	7.32	—	Reference
Q8_0	8.5	7.33	+0.14%	Essentially lossless
Q6_K	6.6	7.35	+0.41%	Virtually indistinguishable
Q5_K_M	5.7	7.40	+1.09%	Quality sweet spot
Q4_K_M	4.9	7.56	+3.28%	Best size/quality tradeoff
Q4_K_S	4.6	7.62	+4.10%	Noticeable on sensitive tasks
Q3_K_M	3.9	7.96	+8.74%	Meaningful degradation
Q3_K_S	3.4	8.96	+22.40%	Significant quality loss

Source: arXiv 2601.14277

Practical guide:

Have the RAM? Use Q5_K_M — 1% quality loss is the best tradeoff
Tight on RAM? Use Q4_K_M — 3% loss is imperceptible for chat/coding
Want lossless? Q8_0 is 0.14% loss — essentially identical to FP16
Desperate for space? Q3_K_M is the floor — below this, quality falls off a cliff

GGUF vs MLX Format vs AWQ vs GPTQ

Format	Apple Silicon Support	Speed on Mac	Quality	Ecosystem
GGUF	Full native (Metal)	Fastest (llama.cpp)	Excellent (K-quants)	Ollama, LM Studio, llama.cpp
MLX native	Full native	Fastest (MLX)	Excellent	mlx-lm, Ollama 0.19+
AWQ	No (CUDA only)	N/A	Best accuracy retention	NVIDIA GPUs only
GPTQ	No (CUDA only)	N/A	Good	NVIDIA GPUs only

GGUF is the only viable format on Apple Silicon for llama.cpp/Ollama. MLX has its own format (safetensors) that's 20-30% more memory-efficient due to zero-copy unified memory. MLX can read GGUF files, but only Q4_0, Q4_1, and Q8_0 — all other quants get cast to FP16, defeating the purpose.

Bottom line: Use GGUF for Ollama/llama.cpp. Use MLX-native models (from mlx-community on HuggingFace) for mlx-lm. Ignore AWQ and GPTQ entirely — they're NVIDIA-only.

RAM Requirements by Model Size

The formula: Model RAM (GB) = Parameters × Bits_per_weight ÷ 8. Keep model weights under 60% of total unified memory to leave room for macOS, KV cache, and the inference runtime.

Parameters	Q4_K_M	Q5_K_M	Q6_K	Q8_0	FP16
1B	0.6 GB	0.7 GB	0.8 GB	1.1 GB	2.0 GB
3B	1.8 GB	2.1 GB	2.5 GB	3.2 GB	6.0 GB
7-8B	4.6 GB	5.3 GB	6.1 GB	8.0 GB	15.0 GB
13-14B	8.0 GB	9.3 GB	10.7 GB	13.8 GB	26.3 GB
22B	12.5 GB	14.6 GB	16.9 GB	21.7 GB	41.2 GB
27B	15.4 GB	17.9 GB	20.7 GB	26.7 GB	50.6 GB
32B	18.2 GB	21.2 GB	24.5 GB	31.6 GB	60.0 GB
70B	39.9 GB	46.4 GB	53.6 GB	69.1 GB	131.3 GB

What Fits at Each RAM Tier

RAM	Max Model (Q4_K_M)	Max Model (Q8_0)	Sweet Spot
8GB	3B	1B	Phi-3.5-mini 3B Q4
16GB	13-14B	7-8B	Llama 3.1 8B Q4
24GB	22B	13-14B	Qwen 3 14B Q5
32GB	27B (tight)	22B	Qwen 3 14B Q8 or 27B Q4
48GB	70B (tight)	32B	Qwen 3 32B Q5 or 70B Q4
64GB	70B (comfortable)	70B (tight)	Llama 3.3 70B Q4
128GB	70B FP16	100B+	70B Q8 with massive context

What Fits on 32GB

Usable memory budget: ~20GB for model weights (60% rule).

Model	Quant	File Size	Est. tok/s (M4)	Context Headroom	Verdict
Llama 3.1 8B	Q4_K_M	4.6 GB	25-30	Massive (64K+)	Fast, room for everything
Qwen 3.5 9B	Q5_K_M	6.5 GB	18-28	Large (64K)	Strong daily driver
Qwen 3.5 9B	Q4_K_M	5.5 GB	22-32	Massive (64K+)	Great all-rounder
DeepSeek R1 Distill 14B	Q4_K_M	8.0 GB	15-20	Large	Best reasoning at this tier
OpenAI gpt-oss-20b	Q4_K_M	11.5 GB	10-18	Good (32K)	Strong general purpose
Qwen 2.5 Coder 14B	Q4_K_M	8.0 GB	15-25	Large	Best coding model
Gemma 3 12B	Q4_K_M	7.0 GB	20-30	Massive (128K)	Long-context specialist
Gemma 4 26B-A4B (MoE)	Q4	~15 GB	15-30	Moderate (16K)	MoE with vision
Phi-4 14B	Q4_K_M	8.0 GB	15-25	Moderate (16K)	Math/analytical reasoning
Qwen 3.5 27B	Q4_K_M	15.4 GB	6-10	Limited (8K)	Tight fit, short context
Qwen 3.5 35B-A3B (MoE)	Q4	~19 GB	15-30	Limited (8-16K)	MoE with vision, top pick
Any 70B	Any	40+ GB	—	—	Does not fit

HFQwen/Qwen3.5-9B

5.1M1.2K9BApache 2.0

HFQwen/Qwen3.5-35B-A3B

3.5M1.4K35B (3B active)Apache 2.0

HFopenai/gpt-oss-20b

5.9M4.5K20BApache 2.0

HFdeepseek-ai/DeepSeek-R1-Distill-Qwen-14B

604.5K62214BMIT

My recommended starting points for 32GB (but see Choosing a Model below):

Primary: Qwen 3.5 9B Q5_K_M — general use, coding, writing, multimodal
Reasoning: DeepSeek R1 Distill 14B Q4_K_M — chain-of-thought problems
Speed: Llama 3.1 8B Q4_K_M — quick tasks, agent chains
Maximum capability: Qwen 3.5 35B-A3B Q4 — MoE, 35B quality at near-8B speed
Alternative: OpenAI gpt-oss-20b Q4_K_M — strong general purpose, Apache 2.0

Best Mac Mini for Running Local LLMs in 2026

Every Mac Mini config compared with current pricing, model compatibility tables, and used market analysis.

Choosing a Model: It Depends

There is no universally "best" model. The right choice depends on factors that vary by person and use case. Before defaulting to whatever is trending on HuggingFace, ask yourself:

What matters most to you?

Priority	Best Options	Why
Raw quality per token	Qwen 3.5 27B Q4, gpt-oss-20b Q4	Larger dense models = better reasoning
Speed + quality balance	Qwen 3.5 9B Q5_K_M	9B is the sweet spot for 32GB
Chain-of-thought reasoning	DeepSeek R1 Distill 14B	Trained specifically for reasoning
Coding assistance	Qwen 2.5 Coder 14B	Fine-tuned on code, strong benchmarks
Long context (RAG, docs)	Gemma 3 12B (128K context)	Largest native context window
Vision + text (multimodal)	Qwen 3.5 9B, Gemma 4 26B-A4B	Both handle image input natively
Maximum capability on 32GB	Qwen 3.5 35B-A3B (MoE)	35B knowledge, 3B compute cost
Open license required	gpt-oss-20b, Qwen 3.5 (Apache 2.0)	Fully permissive for commercial use
Fastest possible	Llama 3.1 8B Q4_K_M	Smallest footprint, highest tok/s

The model landscape moves fast. The Qwen 3.5 family launched in February 2026 and already has models with 5M+ downloads. Gemma 4 arrived in March 2026 with a compelling MoE variant. OpenAI released their first open-weight model (gpt-oss-20b) in August 2025. By the time you read this, newer options may exist.

How to evaluate a new model for your setup:

Check HuggingFace for downloads, community ratings, and GGUF availability
Verify it fits your RAM budget using the RAM table above — model weights under 60% of total memory
Look for Unsloth Dynamic 2.0 GGUFs or mlx-community conversions for best Apple Silicon performance
Test on YOUR use cases — benchmark scores do not always correlate with subjective quality for your specific tasks

HFgoogle/gemma-4-26B-A4B-it

1.3M58026B (4B active)Gemma

HFQwen/Qwen2.5-Coder-14B-Instruct

960.8K14514BApache 2.0

HFmeta-llama/Llama-3.1-8B-Instruct

9.2M5.7K8BLlama 3.1

MoE vs Dense Models

The critical misconception: MoE does NOT save memory. All expert weights must reside in RAM — the router needs access to every expert to make routing decisions. MoE saves compute, not memory.

What MoE does give you on Apple Silicon:

Faster token generation — fewer active parameters per token means less computation
Higher quality per RAM dollar — a 30B MoE with 3B active params gives you 30B-class knowledge at near-3B speed
Same memory footprint — Mixtral 8x7B (46B total) needs ~26GB at Q4, same as a dense 46B model

MoE Models on 32GB Mac

Model	Total Params	Active Params	Size (Q4)	Fits 32GB?	Est. tok/s	Vision?
Qwen 3.5 35B-A3B	35B	3.3B	~19 GB	Yes (tight)	15-30	Yes
Gemma 4 26B-A4B	26B	4B	~15 GB	Yes	15-30	Yes
Qwen 3 30B-A3B	30.5B	3.3B	~17 GB	Yes (tight)	15-30	No
Mixtral 8x7B	46B	12.9B	~26 GB	Marginal	5-10	No
Qwen 3.5 122B-A10B	122B	10B	~70 GB	No	—	Yes
DeepSeek V3	671B	37B	~380 GB	No	—	No

The MoE landscape has expanded significantly in early 2026. Two standouts for constrained hardware:

Qwen 3.5 35B-A3B (3.5M downloads, 1.3K likes) — successor to Qwen 3 30B-A3B with multimodal vision support. Only 3.3B active parameters means near-8B generation speed despite 35B total knowledge. On M4 Max, expect 64-92 tok/s.
Gemma 4 26B-A4B (1.3M downloads) — Google's MoE entry with 4B active params. Fits more comfortably in 32GB than Qwen 3.5 35B-A3B (~15GB vs ~19GB), leaving more room for KV cache and context.

The memory bandwidth formula explains why MoE helps: you still read all weights from memory, but the routing + expert computation is cheaper, so the gap between theoretical max and real-world performance is smaller.

vLLM on Apple Silicon

Two competing projects bring vLLM's serving capabilities to Mac:

vllm-mlx

Built on MLX. The more feature-complete option.

pip install vllm-mlx
vllm-mlx serve mlx-community/Qwen3-14B-4bit

OpenAI + Anthropic Messages API compatibility
MCP tool calling support
Multimodal (text, vision, audio, embeddings)
Continuous batching — 2-3.4x throughput with concurrent requests
21-87% faster than llama.cpp

Source: github.com/waybarrios/vllm-mlx

vllm-metal

Official vLLM community plugin with native Metal kernels.

curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash

PagedAttention with Metal kernels
Full vLLM scheduler integration
Text-only (no multimodal)
No published benchmarks yet

Source: github.com/vllm-project/vllm-metal

Use vllm-mlx if you need multi-user serving from a Mac. Use Ollama for single-user inference.

When your Mac's memory ceiling limits model size or concurrent users, Railway deploys GPU inference servers (vLLM, Ollama) from GitHub with autoscaling — no DevOps required.

> Railway

GPU Inference in the Cloud

When local hardware hits its limits, Railway deploys GPU workloads with autoscaling — vLLM, Ollama, and custom inference servers with zero DevOps.

[Try Railway Free]

Optimization Techniques

1. KV Cache Quantization

The KV cache stores attention states and grows linearly with context length. On a 14B model at 32K context, it can consume 4-8GB — a huge chunk of your 32GB budget.

llama.cpp supports KV cache quantization via flags:

llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0

KV Cache Type	Memory Savings	Quality Impact	Recommendation
FP16 (default)	1x	Baseline	Default
Q8_0	2x	Less than 1% degradation	Use this. Best tradeoff.
Q4_0	4x theoretical	Noticeable quality loss	Avoid — actually 92% slower at 64K context due to dequant overhead

This is free performance. Q8 KV cache halves memory usage with negligible quality loss, letting you either run larger models or longer contexts.

2. Speculative Decoding

A small "draft" model generates candidate tokens; the large "target" model verifies them in a single forward pass. Output is mathematically identical — zero quality loss.

Framework	Support	Speedup
MLX	Native (`mlx-lm`)	2-3x
llama.cpp	`--draft` flag	2-3x
Ollama	Partial (via backends)	Varies
LM Studio	Beta	2x

Apple's own Recurrent Drafter research achieved 2.3x speedup on Metal. Typical acceptance rate is 60-80%.

# llama.cpp speculative decoding
llama-server -m qwen3-14b-q4.gguf --draft qwen3-0.6b-q8.gguf

3. Flash Attention on Metal

Available through community implementations:

Metal FlashAttention 2.0 — 43-120% faster attention computation
mlx-mfa — Flash Attention for MLX
MLX built-in — Uses optimized Metal attention kernels (not branded Flash Attention, but similar benefits)

The main benefit: reducing attention memory from O(n^2) to O(n) in sequence length. Critical for long-context inference on memory-constrained machines.

4. Context Length vs Speed

Longer context = dramatically slower generation. At 40-50K tokens, expect 10x slower performance than short context.

Practical limits on 32GB (model weight budget ~20GB):

Model (Q4)	Weights	Max Practical Context	tok/s (short)	tok/s (long)
Qwen 3 14B	8 GB	32K-64K	15-25	5-10
Qwen 3 30B-A3B	17 GB	8K-16K	15-30	8-15
Qwen 3 27B	15 GB	8K-16K	6-10	3-5

With Q8 KV cache quantization, roughly double these context limits.

> Mermaid Editor

Visualize model architectures, inference pipelines, and quantization strategies with interactive Mermaid diagrams.

[Try Free]

Best Models for Every Mac Configuration

These tables show strong options per tier — not the single "best" model. See Choosing a Model for how to decide.

M4 Base (16-32GB) — 120 GB/s

Use Case	Model	Quant	RAM	tok/s
Daily driver	Qwen 3.5 9B	Q5_K_M	6.5 GB	18-28
Reasoning	DeepSeek R1 Distill 14B	Q4_K_M	8 GB	15-20
Coding	Qwen 2.5 Coder 14B	Q4_K_M	8 GB	15-25
Speed	Llama 3.1 8B	Q4_K_M	4.6 GB	25-30
Long context	Gemma 3 12B	Q4_K_M	7 GB	20-30
General (20B)	OpenAI gpt-oss-20b	Q4_K_M	11.5 GB	10-18
MoE	Gemma 4 26B-A4B	Q4	~15 GB	15-30

M4 Pro (24-48GB) — 273 GB/s

Use Case	Model	Quant	RAM	tok/s
Daily driver	Qwen 3.5 27B	Q4_K_M	15 GB	15-25
Quality	Qwen 3.5 9B	Q8_0	10 GB	30-45
MoE	Qwen 3.5 35B-A3B	Q4	19 GB	30-50
Coding	Qwen 2.5 Coder 32B	Q4_K_M	18 GB	12-22
General	OpenAI gpt-oss-20b	Q5_K_M	14 GB	15-25
Max capability (48GB)	Llama 3.3 70B	Q4_K_M	40 GB	7-10

M4 Max (64-128GB) — 546 GB/s

Use Case	Model	Quant	RAM	tok/s
Frontier local	Llama 3.3 70B	Q6_K	54 GB	10-15
Quality 70B	Llama 3.3 70B	Q8_0	69 GB	8-12
Fast MoE	Qwen 3.5 35B-A3B	Q4	19 GB	64-92
Maximum quality	Qwen 3.5 27B	Q8_0	27 GB	20-30
Large MoE	Qwen 3.5 122B-A10B	Q4	~70 GB	10-20

The M4 Chip Architecture

Spec	M4	M4 Pro	M4 Max
CPU Cores	10 (4P + 6E)	12-14	14-16
GPU Cores	8-10	16-20	32-40
Neural Engine	16-core, 38 TOPS	16-core, 38 TOPS	16-core, 38 TOPS
Max RAM	16/24/32 GB	24/48 GB	36/48/64/128 GB
Memory Bandwidth	120 GB/s	273 GB/s	546 GB/s

Does the Neural Engine Help?

Not yet — but research is progressing. The 16-core Neural Engine (38 TOPS) is designed for vision, speech, and sensor data — not the large matrix operations in LLM inference. MLX does not use the ANE. Neither does llama.cpp. CoreML can theoretically route to ANE but imposes model size limits that make it impractical for 7B+ models.

However, the Orion project (March 2026) demonstrated the first open system for direct ANE programming, bypassing CoreML entirely via private APIs. It achieved 170+ tok/s for GPT-2 124M inference on M4 Max and documented 20 ANE hardware constraints (14 previously unknown). While not yet practical for large models, Orion shows the ANE has untapped potential for LLM workloads.

For now, the GPU is your inference engine. All production frameworks use Metal shaders on the GPU cores.

Note on M5: Apple's M5 chip introduced Neural Accelerators inside each GPU core specifically for matrix multiplication, providing up to 4x speedup for time-to-first-token. This is different from the standalone Neural Engine — it's integrated into the GPU pipeline. M5 memory bandwidth is 153 GB/s (28% higher than M4), yielding 19-27% faster token generation. MLX supports these accelerators natively.

Unsloth: Training Tool, Quantization Game-Changer

Unsloth is primarily a fine-tuning tool — it makes QLoRA training 2-5x faster with 50-70% less VRAM. It does NOT run inference. For inference, use Ollama/llama.cpp/MLX.

Apple Silicon training status: Not yet supported natively. Requires NVIDIA GPUs (CUDA). MLX training is listed as "coming soon."

Why Unsloth matters for you anyway: Dynamic 2.0 GGUFs.

Unsloth Dynamic 2.0 is an advanced quantization method that:

Analyzes each layer individually and picks the optimal quantization type per layer
Uses a calibration dataset of 1.5M+ tokens for conversational optimization
Reduces KL divergence by 5-8% compared to standard quantization at the same bit level
The resulting GGUFs run on any inference engine (llama.cpp, Ollama, LM Studio)

Download Unsloth Dynamic 2.0 GGUFs from HuggingFace for the best quality-at-size. You don't need Unsloth installed — just the model files. Top Unsloth GGUFs for Apple Silicon:

HFunsloth/Qwen3.5-35B-A3B-GGUF

1.4M79435B MoEDynamic 2.0

HFunsloth/Qwen3.5-9B-GGUF

1.1M4879BDynamic 2.0

HFunsloth/DeepSeek-R1-Distill-Qwen-14B-GGUF

92.4K13314BDynamic 2.0

Obsidian + Claude Code: The Complete Integration Guide

Connect your local inference setup to Obsidian for an AI-powered knowledge base workflow.

> Apple Silicon LLM Inference Guide

[Get the Premium Guide — $19]

> StarMorph Config

Bootstrap your Mac Mini for local AI — shell configs, Ollama setup, and developer tools in one command.