Published on
17 min read

Apple Silicon LLM Inference Optimization: The Complete Guide to Maximum Performance

Authors

TL;DR: MLX is 20-87% faster than llama.cpp for generation on Apple Silicon (under 14B params). Use Ollama 0.19+ with the MLX backend for 93% faster decode with zero config. Q4_K_M is the sweet spot quantization (3.3% quality loss, 75% size reduction). On a 32GB Mac, top picks include Qwen 3.5 9B (daily driver), DeepSeek R1 Distill 14B (reasoning), Qwen 3.5 35B-A3B (MoE), and OpenAI gpt-oss-20b — but the "best" model depends on your use case, context length needs, and quality tolerance. Memory bandwidth is your bottleneck — not compute, not VRAM, not GPU cores. This guide covers every optimization that matters.

I run a 32GB M4 Mac Mini as my local inference box. After weeks of benchmarking different engines, quantization levels, models, and optimization techniques, I've compiled everything into one reference. The Apple Silicon inference ecosystem has matured dramatically in 2026 — MLX is no longer experimental, Ollama ships an MLX backend, and vLLM has two competing Apple Silicon ports. The performance gap between local and cloud is narrowing fast for mid-size models.

This guide is specifically about optimization — squeezing maximum tokens per second at maximum quality from Apple Silicon hardware. If you need a broader overview of inference tools and hardware, see my local LLM inference guide and Mac Mini buying guide.

Table of Contents

The Bandwidth Bottleneck

LLM token generation is memory-bandwidth-bound, not compute-bound. Every generated token requires reading the entire model's weights from memory once. This means your tokens-per-second ceiling is a simple formula:

Max tok/s = Memory Bandwidth (GB/s) ÷ Model Size in Memory (GB)

Applied to M4-series chips:

ChipBandwidth7B Q4 (~4GB)14B Q4 (~8GB)32B Q4 (~18GB)70B Q4 (~40GB)
M4120 GB/s~30 tok/s~15 tok/s~6.7 tok/sN/A
M4 Pro273 GB/s~68 tok/s~34 tok/s~15 tok/s~7 tok/s
M4 Max546 GB/s~136 tok/s~68 tok/s~30 tok/s~14 tok/s

Real-world numbers hit 60-80% of theoretical due to KV cache reads, attention computation, and kernel overhead. But the relationship holds: quantization is a direct multiplier. Going from FP16 to Q4 gives you 4x throughput because you move 4x less data per token.

This is why buying more GPU cores barely helps. The M4 Pro's 16 GPU cores vs the base M4's 10 cores is a 60% increase in compute — but its 2.3x increase in memory bandwidth is what actually drives the 2x+ inference speedup.

MLX vs llama.cpp vs Ollama: Benchmarks

The Apple Silicon inference landscape shifted in 2026. MLX went from experimental to the fastest option for most workloads.

Generation Speed (decode tok/s)

M4 Max (128GB):

ModelQuantMLXllama.cppMLX Advantage
Qwen3-0.6B4-bit525.5281.5+87%
Llama-3.2-1B4-bit461.9331.3+39%
Qwen3-4B4-bit159.0118.2+35%
Qwen3-8B4-bit93.376.9+21%
Qwen2.5-27B4-bit~14~14Tied

Source: Groundy benchmarks

Key pattern: MLX leads by 20-87% for models under ~14B. The advantage collapses at 27B+ where memory bandwidth saturates and both frameworks hit the same ceiling.

Ollama 0.19 MLX Backend vs llama.cpp Backend

Ollama 0.19 added an MLX backend that auto-activates on Macs with 32GB+ RAM:

Metricllama.cpp (0.18)MLX (0.19)Improvement
Prefill1,147 tok/s1,804 tok/s+57%
Decode57.8 tok/s111.4 tok/s+93%
Total duration4.2s2.3s-45%

Model: Qwen3.5-35B-A3B on M4 Max 64GB. Source: DEV Community

To enable: update to Ollama 0.19+ and set the environment variable:

export OLLAMA_MLX=1
ollama run qwen3:14b

The MLX backend requires 32GB+ unified memory. Base M4 models with 8-16GB will not activate it.

The Prefill Problem

MLX has a critical weakness: it performs full prefill before emitting any tokens. This means time-to-first-token (TTFT) rises linearly with input length.

Benchmark at 8.5K context (M1 Max, Qwen3.5-35B-A3B):

PhaseMLXGGUF (llama.cpp)
Prefill time49.4s37.8s
Prefill as % of total94%87%
Effective throughput3 tok/sFaster overall

Source: famstack.dev

The UI reported 51 tok/s for MLX generation, but effective wall-clock throughput was only 3 tok/s because 94% of time was spent prefilling.

When to Use Which

ScenarioWinnerWhy
Short input, long output (writing, code gen)MLXHigher sustained generation tok/s
Long input, short output (RAG, Q&A, classification)llama.cpp (GGUF)Faster prefill, lower TTFT
Easiest setupOllama 0.19+Auto-MLX on 32GB+, zero config
Model doesn't fit in memoryllama.cppCPU/GPU layer splitting (no MLX equivalent)
Multi-user servingvllm-mlxContinuous batching, 2-3.4x scaling
related
Local LLM Inference in 2026: The Complete Guide
Broader overview of 10 inference tools, hardware at every budget, and the open-weight model ecosystem.

Quantization Deep Dive

K-Quants: Why They Exist

K-quants (Q4_K_M, Q5_K_M, Q6_K) replaced legacy quantization (Q4_0, Q4_1) in llama.cpp. The difference is architectural:

FeatureLegacy (Q4_0)K-quants (Q4_K_M)
Block structureFlat, single-levelHierarchical super-blocks (256 values)
Scale quantizationOne scale per blockQuantized scales + mins per sub-block
Bit allocationUniform across all layersVariable precision by layer sensitivity
File size (7B)3.50 GB3.80 GB
Quality (ppl delta)+0.2499+0.0535

K-quants deliver 3-4x less perplexity increase at the same file size. The llama.cpp maintainers now recommend Q3_K_M over Q4_0 — the 3-bit K-quant is better than the 4-bit legacy format.

The S/M/L suffixes control metadata overhead: S (smaller, faster), M (balanced — use this), L (more metadata, best reconstruction).

Quality Comparison by Quantization Level

Measured on Llama-3.1-8B-Instruct, WikiText-2 perplexity:

QuantBits/WeightPerplexityDelta from FP16Quality Assessment
FP1616.07.32Reference
Q8_08.57.33+0.14%Essentially lossless
Q6_K6.67.35+0.41%Virtually indistinguishable
Q5_K_M5.77.40+1.09%Quality sweet spot
Q4_K_M4.97.56+3.28%Best size/quality tradeoff
Q4_K_S4.67.62+4.10%Noticeable on sensitive tasks
Q3_K_M3.97.96+8.74%Meaningful degradation
Q3_K_S3.48.96+22.40%Significant quality loss

Source: arXiv 2601.14277

Practical guide:

  • Have the RAM? Use Q5_K_M — 1% quality loss is the best tradeoff
  • Tight on RAM? Use Q4_K_M — 3% loss is imperceptible for chat/coding
  • Want lossless? Q8_0 is 0.14% loss — essentially identical to FP16
  • Desperate for space? Q3_K_M is the floor — below this, quality falls off a cliff

GGUF vs MLX Format vs AWQ vs GPTQ

FormatApple Silicon SupportSpeed on MacQualityEcosystem
GGUFFull native (Metal)Fastest (llama.cpp)Excellent (K-quants)Ollama, LM Studio, llama.cpp
MLX nativeFull nativeFastest (MLX)Excellentmlx-lm, Ollama 0.19+
AWQNo (CUDA only)N/ABest accuracy retentionNVIDIA GPUs only
GPTQNo (CUDA only)N/AGoodNVIDIA GPUs only

GGUF is the only viable format on Apple Silicon for llama.cpp/Ollama. MLX has its own format (safetensors) that's 20-30% more memory-efficient due to zero-copy unified memory. MLX can read GGUF files, but only Q4_0, Q4_1, and Q8_0 — all other quants get cast to FP16, defeating the purpose.

Bottom line: Use GGUF for Ollama/llama.cpp. Use MLX-native models (from mlx-community on HuggingFace) for mlx-lm. Ignore AWQ and GPTQ entirely — they're NVIDIA-only.

RAM Requirements by Model Size

The formula: Model RAM (GB) = Parameters × Bits_per_weight ÷ 8. Keep model weights under 60% of total unified memory to leave room for macOS, KV cache, and the inference runtime.

ParametersQ4_K_MQ5_K_MQ6_KQ8_0FP16
1B0.6 GB0.7 GB0.8 GB1.1 GB2.0 GB
3B1.8 GB2.1 GB2.5 GB3.2 GB6.0 GB
7-8B4.6 GB5.3 GB6.1 GB8.0 GB15.0 GB
13-14B8.0 GB9.3 GB10.7 GB13.8 GB26.3 GB
22B12.5 GB14.6 GB16.9 GB21.7 GB41.2 GB
27B15.4 GB17.9 GB20.7 GB26.7 GB50.6 GB
32B18.2 GB21.2 GB24.5 GB31.6 GB60.0 GB
70B39.9 GB46.4 GB53.6 GB69.1 GB131.3 GB

What Fits at Each RAM Tier

RAMMax Model (Q4_K_M)Max Model (Q8_0)Sweet Spot
8GB3B1BPhi-3.5-mini 3B Q4
16GB13-14B7-8BLlama 3.1 8B Q4
24GB22B13-14BQwen 3 14B Q5
32GB27B (tight)22BQwen 3 14B Q8 or 27B Q4
48GB70B (tight)32BQwen 3 32B Q5 or 70B Q4
64GB70B (comfortable)70B (tight)Llama 3.3 70B Q4
128GB70B FP16100B+70B Q8 with massive context

What Fits on 32GB

Usable memory budget: ~20GB for model weights (60% rule).

ModelQuantFile SizeEst. tok/s (M4)Context HeadroomVerdict
Llama 3.1 8BQ4_K_M4.6 GB25-30Massive (64K+)Fast, room for everything
Qwen 3.5 9BQ5_K_M6.5 GB18-28Large (64K)Strong daily driver
Qwen 3.5 9BQ4_K_M5.5 GB22-32Massive (64K+)Great all-rounder
DeepSeek R1 Distill 14BQ4_K_M8.0 GB15-20LargeBest reasoning at this tier
OpenAI gpt-oss-20bQ4_K_M11.5 GB10-18Good (32K)Strong general purpose
Qwen 2.5 Coder 14BQ4_K_M8.0 GB15-25LargeBest coding model
Gemma 3 12BQ4_K_M7.0 GB20-30Massive (128K)Long-context specialist
Gemma 4 26B-A4B (MoE)Q4~15 GB15-30Moderate (16K)MoE with vision
Phi-4 14BQ4_K_M8.0 GB15-25Moderate (16K)Math/analytical reasoning
Qwen 3.5 27BQ4_K_M15.4 GB6-10Limited (8K)Tight fit, short context
Qwen 3.5 35B-A3B (MoE)Q4~19 GB15-30Limited (8-16K)MoE with vision, top pick
Any 70BAny40+ GBDoes not fit
HFQwen/Qwen3.5-9B
5.1M1.2K9BApache 2.0
HFQwen/Qwen3.5-35B-A3B
3.5M1.4K35B (3B active)Apache 2.0
HFopenai/gpt-oss-20b
5.9M4.5K20BApache 2.0
HFdeepseek-ai/DeepSeek-R1-Distill-Qwen-14B
604.5K62214BMIT

My recommended starting points for 32GB (but see Choosing a Model below):

  • Primary: Qwen 3.5 9B Q5_K_M — general use, coding, writing, multimodal
  • Reasoning: DeepSeek R1 Distill 14B Q4_K_M — chain-of-thought problems
  • Speed: Llama 3.1 8B Q4_K_M — quick tasks, agent chains
  • Maximum capability: Qwen 3.5 35B-A3B Q4 — MoE, 35B quality at near-8B speed
  • Alternative: OpenAI gpt-oss-20b Q4_K_M — strong general purpose, Apache 2.0
related
Best Mac Mini for Running Local LLMs in 2026
Every Mac Mini config compared with current pricing, model compatibility tables, and used market analysis.

Choosing a Model: It Depends

There is no universally "best" model. The right choice depends on factors that vary by person and use case. Before defaulting to whatever is trending on HuggingFace, ask yourself:

What matters most to you?

PriorityBest OptionsWhy
Raw quality per tokenQwen 3.5 27B Q4, gpt-oss-20b Q4Larger dense models = better reasoning
Speed + quality balanceQwen 3.5 9B Q5_K_M9B is the sweet spot for 32GB
Chain-of-thought reasoningDeepSeek R1 Distill 14BTrained specifically for reasoning
Coding assistanceQwen 2.5 Coder 14BFine-tuned on code, strong benchmarks
Long context (RAG, docs)Gemma 3 12B (128K context)Largest native context window
Vision + text (multimodal)Qwen 3.5 9B, Gemma 4 26B-A4BBoth handle image input natively
Maximum capability on 32GBQwen 3.5 35B-A3B (MoE)35B knowledge, 3B compute cost
Open license requiredgpt-oss-20b, Qwen 3.5 (Apache 2.0)Fully permissive for commercial use
Fastest possibleLlama 3.1 8B Q4_K_MSmallest footprint, highest tok/s

The model landscape moves fast. The Qwen 3.5 family launched in February 2026 and already has models with 5M+ downloads. Gemma 4 arrived in March 2026 with a compelling MoE variant. OpenAI released their first open-weight model (gpt-oss-20b) in August 2025. By the time you read this, newer options may exist.

How to evaluate a new model for your setup:

  1. Check HuggingFace for downloads, community ratings, and GGUF availability
  2. Verify it fits your RAM budget using the RAM table above — model weights under 60% of total memory
  3. Look for Unsloth Dynamic 2.0 GGUFs or mlx-community conversions for best Apple Silicon performance
  4. Test on YOUR use cases — benchmark scores do not always correlate with subjective quality for your specific tasks
HFgoogle/gemma-4-26B-A4B-it
1.3M58026B (4B active)Gemma
HFQwen/Qwen2.5-Coder-14B-Instruct
960.8K14514BApache 2.0
HFmeta-llama/Llama-3.1-8B-Instruct
9.2M5.7K8BLlama 3.1

MoE vs Dense Models

The critical misconception: MoE does NOT save memory. All expert weights must reside in RAM — the router needs access to every expert to make routing decisions. MoE saves compute, not memory.

What MoE does give you on Apple Silicon:

  • Faster token generation — fewer active parameters per token means less computation
  • Higher quality per RAM dollar — a 30B MoE with 3B active params gives you 30B-class knowledge at near-3B speed
  • Same memory footprint — Mixtral 8x7B (46B total) needs ~26GB at Q4, same as a dense 46B model

MoE Models on 32GB Mac

ModelTotal ParamsActive ParamsSize (Q4)Fits 32GB?Est. tok/sVision?
Qwen 3.5 35B-A3B35B3.3B~19 GBYes (tight)15-30Yes
Gemma 4 26B-A4B26B4B~15 GBYes15-30Yes
Qwen 3 30B-A3B30.5B3.3B~17 GBYes (tight)15-30No
Mixtral 8x7B46B12.9B~26 GBMarginal5-10No
Qwen 3.5 122B-A10B122B10B~70 GBNoYes
DeepSeek V3671B37B~380 GBNoNo

The MoE landscape has expanded significantly in early 2026. Two standouts for constrained hardware:

  • Qwen 3.5 35B-A3B (3.5M downloads, 1.3K likes) — successor to Qwen 3 30B-A3B with multimodal vision support. Only 3.3B active parameters means near-8B generation speed despite 35B total knowledge. On M4 Max, expect 64-92 tok/s.
  • Gemma 4 26B-A4B (1.3M downloads) — Google's MoE entry with 4B active params. Fits more comfortably in 32GB than Qwen 3.5 35B-A3B (~15GB vs ~19GB), leaving more room for KV cache and context.

The memory bandwidth formula explains why MoE helps: you still read all weights from memory, but the routing + expert computation is cheaper, so the gap between theoretical max and real-world performance is smaller.

vLLM on Apple Silicon

Two competing projects bring vLLM's serving capabilities to Mac:

vllm-mlx

Built on MLX. The more feature-complete option.

pip install vllm-mlx
vllm-mlx serve mlx-community/Qwen3-14B-4bit
  • OpenAI + Anthropic Messages API compatibility
  • MCP tool calling support
  • Multimodal (text, vision, audio, embeddings)
  • Continuous batching — 2-3.4x throughput with concurrent requests
  • 21-87% faster than llama.cpp

Source: github.com/waybarrios/vllm-mlx

vllm-metal

Official vLLM community plugin with native Metal kernels.

curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash
  • PagedAttention with Metal kernels
  • Full vLLM scheduler integration
  • Text-only (no multimodal)
  • No published benchmarks yet

Source: github.com/vllm-project/vllm-metal

Use vllm-mlx if you need multi-user serving from a Mac. Use Ollama for single-user inference.

Optimization Techniques

1. KV Cache Quantization

The KV cache stores attention states and grows linearly with context length. On a 14B model at 32K context, it can consume 4-8GB — a huge chunk of your 32GB budget.

llama.cpp supports KV cache quantization via flags:

llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0
KV Cache TypeMemory SavingsQuality ImpactRecommendation
FP16 (default)1xBaselineDefault
Q8_02xLess than 1% degradationUse this. Best tradeoff.
Q4_04x theoreticalNoticeable quality lossAvoid — actually 92% slower at 64K context due to dequant overhead

This is free performance. Q8 KV cache halves memory usage with negligible quality loss, letting you either run larger models or longer contexts.

2. Speculative Decoding

A small "draft" model generates candidate tokens; the large "target" model verifies them in a single forward pass. Output is mathematically identical — zero quality loss.

FrameworkSupportSpeedup
MLXNative (mlx-lm)2-3x
llama.cpp--draft flag2-3x
OllamaPartial (via backends)Varies
LM StudioBeta2x

Apple's own Recurrent Drafter research achieved 2.3x speedup on Metal. Typical acceptance rate is 60-80%.

# llama.cpp speculative decoding
llama-server -m qwen3-14b-q4.gguf --draft qwen3-0.6b-q8.gguf

3. Flash Attention on Metal

Available through community implementations:

  • Metal FlashAttention 2.0 — 43-120% faster attention computation
  • mlx-mfa — Flash Attention for MLX
  • MLX built-in — Uses optimized Metal attention kernels (not branded Flash Attention, but similar benefits)

The main benefit: reducing attention memory from O(n^2) to O(n) in sequence length. Critical for long-context inference on memory-constrained machines.

4. Context Length vs Speed

Longer context = dramatically slower generation. At 40-50K tokens, expect 10x slower performance than short context.

Practical limits on 32GB (model weight budget ~20GB):

Model (Q4)WeightsMax Practical Contexttok/s (short)tok/s (long)
Qwen 3 14B8 GB32K-64K15-255-10
Qwen 3 30B-A3B17 GB8K-16K15-308-15
Qwen 3 27B15 GB8K-16K6-103-5

With Q8 KV cache quantization, roughly double these context limits.

> Mermaid Editor
Visualize model architectures, inference pipelines, and quantization strategies with interactive Mermaid diagrams.
[Try Free]

Best Models for Every Mac Configuration

These tables show strong options per tier — not the single "best" model. See Choosing a Model for how to decide.

M4 Base (16-32GB) — 120 GB/s

Use CaseModelQuantRAMtok/s
Daily driverQwen 3.5 9BQ5_K_M6.5 GB18-28
ReasoningDeepSeek R1 Distill 14BQ4_K_M8 GB15-20
CodingQwen 2.5 Coder 14BQ4_K_M8 GB15-25
SpeedLlama 3.1 8BQ4_K_M4.6 GB25-30
Long contextGemma 3 12BQ4_K_M7 GB20-30
General (20B)OpenAI gpt-oss-20bQ4_K_M11.5 GB10-18
MoEGemma 4 26B-A4BQ4~15 GB15-30

M4 Pro (24-48GB) — 273 GB/s

Use CaseModelQuantRAMtok/s
Daily driverQwen 3.5 27BQ4_K_M15 GB15-25
QualityQwen 3.5 9BQ8_010 GB30-45
MoEQwen 3.5 35B-A3BQ419 GB30-50
CodingQwen 2.5 Coder 32BQ4_K_M18 GB12-22
GeneralOpenAI gpt-oss-20bQ5_K_M14 GB15-25
Max capability (48GB)Llama 3.3 70BQ4_K_M40 GB7-10

M4 Max (64-128GB) — 546 GB/s

Use CaseModelQuantRAMtok/s
Frontier localLlama 3.3 70BQ6_K54 GB10-15
Quality 70BLlama 3.3 70BQ8_069 GB8-12
Fast MoEQwen 3.5 35B-A3BQ419 GB64-92
Maximum qualityQwen 3.5 27BQ8_027 GB20-30
Large MoEQwen 3.5 122B-A10BQ4~70 GB10-20

The M4 Chip Architecture

SpecM4M4 ProM4 Max
CPU Cores10 (4P + 6E)12-1414-16
GPU Cores8-1016-2032-40
Neural Engine16-core, 38 TOPS16-core, 38 TOPS16-core, 38 TOPS
Max RAM16/24/32 GB24/48 GB36/48/64/128 GB
Memory Bandwidth120 GB/s273 GB/s546 GB/s

Does the Neural Engine Help?

Not yet — but research is progressing. The 16-core Neural Engine (38 TOPS) is designed for vision, speech, and sensor data — not the large matrix operations in LLM inference. MLX does not use the ANE. Neither does llama.cpp. CoreML can theoretically route to ANE but imposes model size limits that make it impractical for 7B+ models.

However, the Orion project (March 2026) demonstrated the first open system for direct ANE programming, bypassing CoreML entirely via private APIs. It achieved 170+ tok/s for GPT-2 124M inference on M4 Max and documented 20 ANE hardware constraints (14 previously unknown). While not yet practical for large models, Orion shows the ANE has untapped potential for LLM workloads.

For now, the GPU is your inference engine. All production frameworks use Metal shaders on the GPU cores.

Note on M5: Apple's M5 chip introduced Neural Accelerators inside each GPU core specifically for matrix multiplication, providing up to 4x speedup for time-to-first-token. This is different from the standalone Neural Engine — it's integrated into the GPU pipeline. M5 memory bandwidth is 153 GB/s (28% higher than M4), yielding 19-27% faster token generation. MLX supports these accelerators natively.

Unsloth: Training Tool, Quantization Game-Changer

Unsloth is primarily a fine-tuning tool — it makes QLoRA training 2-5x faster with 50-70% less VRAM. It does NOT run inference. For inference, use Ollama/llama.cpp/MLX.

Apple Silicon training status: Not yet supported natively. Requires NVIDIA GPUs (CUDA). MLX training is listed as "coming soon."

Why Unsloth matters for you anyway: Dynamic 2.0 GGUFs.

Unsloth Dynamic 2.0 is an advanced quantization method that:

  • Analyzes each layer individually and picks the optimal quantization type per layer
  • Uses a calibration dataset of 1.5M+ tokens for conversational optimization
  • Reduces KL divergence by 5-8% compared to standard quantization at the same bit level
  • The resulting GGUFs run on any inference engine (llama.cpp, Ollama, LM Studio)

Download Unsloth Dynamic 2.0 GGUFs from HuggingFace for the best quality-at-size. You don't need Unsloth installed — just the model files. Top Unsloth GGUFs for Apple Silicon:

HFunsloth/Qwen3.5-35B-A3B-GGUF
1.4M79435B MoEDynamic 2.0
HFunsloth/Qwen3.5-9B-GGUF
1.1M4879BDynamic 2.0
HFunsloth/DeepSeek-R1-Distill-Qwen-14B-GGUF
92.4K13314BDynamic 2.0
related
Obsidian + Claude Code: The Complete Integration Guide
Connect your local inference setup to Obsidian for an AI-powered knowledge base workflow.
> StarMorph Config
Bootstrap your Mac Mini for local AI — shell configs, Ollama setup, and developer tools in one command.
[Browse Configs]

Sources

Research Papers

arXivNative LLM Inference at Scale on Apple Silicon via vllm-mlx (2026)arXivComparative Study of MLX, Ollama, llama.cpp on Apple Silicon (2025)arXivProfiling LLM Inference on Apple Silicon: A Quantization Perspective (2025)arXivWhich Quantization Should I Use? Unified llama.cpp Evaluation (2026)arXivBenchmarking Post-Training Quantization in LLMs (2025)arXivSystematic Evaluation of On-Device LLMs: Quantization and Performance (2025)arXivSparse Self-Speculative Decoding for Reasoning Models (2025)arXivMoBiLE: MoE Inference on Consumer Hardware (2025)arXivOrion: Programming Apple's Neural Engine for LLM Training and Inference (2026)arXivBenchmarking On-Device ML on Apple Silicon with MLX (2025)arXivKVQuant: Towards 10M Context Length via KV Cache Quantization (2024)arXivSparQ Attention: Bandwidth-Efficient LLM Inference (2023)

Benchmarks and Guides

Groundy: MLX vs llama.cpp Benchmarksfamstack: MLX vs GGUF Effective ThroughputOllama Blog: MLX BackendOllama MLX 93% Speedup Walkthroughllama.cpp Apple Silicon BenchmarksSiliconBench: Community Apple Silicon Benchmarks

Tools and Frameworks

  • MLX Framework (Apple) — Apple's native ML framework for Apple Silicon
  • llama.cpp — CPU/GPU inference engine with Metal backend
  • Ollama — One-command local inference with MLX backend
  • vllm-mlx — vLLM serving on Apple Silicon via MLX
  • vllm-metal — Official vLLM Metal plugin
  • Unsloth — Fine-tuning tool with Dynamic 2.0 GGUF quantization
  • QMD — Local markdown search engine