What is the best tool for running LLMs locally?

Ollama is the best starting point for developers — one command to install, one to run, with OpenAI and Anthropic-compatible APIs. LM Studio is best for beginners who want a GUI. vLLM is the industry standard for production GPU serving. The right tool depends on your use case, skill level, and hardware.

How much RAM do I need to run local LLMs?

At Q4 quantization: 4GB for 3B models, 6GB for 7B models, 10GB for 13B models, 20GB for 30B models, 40GB for 70B models, and 64GB+ for 100B+ models. Apple Silicon Macs can use all system RAM for inference thanks to unified memory. NVIDIA GPUs are limited to VRAM (typically 12-24GB on consumer cards).

What is the best hardware for local LLM inference?

The Mac Mini M4 Pro with 48GB (~$1,999) is the best value — it runs 70B parameter models and costs less than a good GPU. For NVIDIA, a used RTX 3090 (24GB, ~$900) offers the best price-to-VRAM ratio. Memory bandwidth is the bottleneck for inference, not compute FLOPS.

What is GGUF quantization and which level should I use?

GGUF is the universal model format created by llama.cpp, used by Ollama, LM Studio, and most local inference tools. Q4_K_M is the sweet spot for most users — 92% quality retention with 75% size reduction. Use Q6_K or Q8_0 if you have the RAM for higher quality. K-quant variants allocate more bits to important model layers.

Can I use Claude Code with local models instead of the Anthropic API?

Yes. Tools like cc-mirror (by Numman Ali) let you create isolated Claude Code variants that connect to local models via Ollama, OpenRouter, or any OpenAI-compatible endpoint. OpenCode is another option — an open-source terminal AI coding agent that works with 75+ LLM providers.

What are the best open-weight models in 2026?

The top open-weight models in early 2026 include GLM-5 (744B params, MIT licensed, #1 on Text Arena), MiniMax M2 (230B total, 10B active per token), Hermes 4 by Nous Research (up to 405B), and the Qwen 3.5 family. All run on standard inference stacks like vLLM and Ollama, though frontier cloud models still lead on complex reasoning.

Local LLM Inference in 2026: The Complete Guide to Tools, Hardware & Open-Weight Models

TL;DR: Ollama is the fastest path to running local LLMs (one command to install, one to run). The Mac Mini M4 Pro 48GB (~$1,999) is the best-value hardware. Q4_K_M is the sweet spot quantization for most users. Open-weight models like GLM-5, MiniMax M2, and Hermes 4 are impressively capable for a wide range of tasks. This guide covers 10 inference tools, every quantization format, hardware at every budget, and the builders making all of this possible.

I've been setting up local inference on my own hardware recently — an M4 Pro Mac Mini running Ollama — and I wanted to compile everything I've learned into one place. This guide is as much for my own reference as it is for anyone else exploring this space.

The tooling in 2026 has matured to the point where a $600 Mac Mini can run 14B parameter models and a $1,600 setup handles 70B. Whether you want to reduce API costs for simple tasks, keep sensitive data private, build offline-capable apps, or just understand how these models actually work, there are real options now.

I still use Claude Code as my primary coding tool — local models aren't a replacement for frontier cloud inference on complex tasks. But they're genuinely useful for a lot of workflows, and the ecosystem is worth understanding. This guide covers the tools, formats, hardware, and people building the open-source ecosystem.

Get the full 14-page StarMorph Research PDF — detailed comparison tables, hardware buying guide, and thought leader profiles in a premium dark-mode report.

Tool Comparison Matrix
Ollama — The Developer Default
LM Studio — The Visual Explorer
vLLM — Production GPU Serving
llama.cpp — The Foundation
ExoLabs — Distributed Inference
Other Notable Tools
Quantization Formats and Tradeoffs
Choosing the Right Tool
Hardware Buying Guide
Thought Leaders and Builder Strategies
Key Themes

Tool Comparison Matrix

Ten tools, compared across what matters. Stars reflect community adoption as of March 2026.

Tool	Stars	Platforms	Model Formats	GPU Required?	API Compatibility	Best For
Ollama	166k	Mac/Win/Linux	GGUF	No	OpenAI + Anthropic	Developer workflows
llama.cpp	98.6k	All + mobile	GGUF	No	OpenAI	Foundation / power users
Exo	42.7k	Mac/Linux/mobile	MLX / tinygrad	No	Varies	Distributed inference
Jan.ai	41.1k	Mac/Win/Linux	GGUF, MLX	No	OpenAI	Privacy-first desktop
LocalAI	35-42k	Linux/Mac/Win	Multi-format	No	OpenAI + Anthropic	Drop-in API replacement
vLLM	31k+	Linux	safetensors, AWQ, GPTQ	Yes	OpenAI	Production GPU serving
MLX	24.6k	macOS only	safetensors	No (Apple Silicon)	Third-party	Mac-native development
LM Studio	N/A (closed)	Mac/Win/Linux	GGUF / MLX	No	OpenAI	Visual model exploration
KoboldCpp	9.5k	All + Android	GGUF	No	Triple (OAI + Ollama + Kobold)	Creative writing
GPT4All	N/A	Mac/Win/Linux	GGUF	No	OpenAI	Private document chat

Every tool above except LM Studio is open-source. Most build on top of llama.cpp — the foundational C/C++ inference engine that pioneered running LLMs on consumer hardware.

Ollama — The Developer Default

Ollama is the fastest path from zero to running local models. One command to install, one to run, and you get an OpenAI-compatible API on localhost:11434. It's open-source (MIT), written in Go, and has 166k GitHub stars — the largest open-source AI project on GitHub by a wide margin.

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Run a model
ollama run llama3

That's it. No Python environments, no CUDA toolkit, no configuration files.

Why developers default to Ollama

OpenAI + Anthropic API compatibility — Claude Code and OpenAI Codex CLI can use Ollama as a local backend. Your existing API client code works with minimal changes.
Largest model registry — 100+ models available with ollama pull. One-command downloads.
Performance — M3 Pro generates 40-60 tok/s on 7B models. Benefits from all llama.cpp optimizations (up to 35% faster from CES 2026 NVIDIA improvements).
Image generation — Added to macOS in January 2026.
Web search + structured outputs — Both added in 2026.

Where Ollama falls short

GGUF-only for native format — safetensors/PyTorch models require a conversion step via Modelfile
No GUI — third-party frontends like Open WebUI fill this gap
Slightly higher overhead than raw llama.cpp (the abstraction layer costs a few percent)
Custom model importing requires creating a Modelfile rather than just pointing at a file

For most developers, Ollama is the right first tool. Start here, then graduate to other tools as your needs become more specific.

Get the free macOS Bootstrap Script — idempotent setup for Homebrew, Zsh, Node.js, Ollama, and 30+ dev tools in one command.

LM Studio — The Visual Explorer

LM Studio is the most beginner-friendly option — a desktop application where you browse models, click to download, and start chatting. Zero terminal knowledge required. Closed-source but free for personal use.

What makes it stand out:

Built-in model browser with one-click downloads from Hugging Face
MLX backend on Apple Silicon for optimized Mac inference
Split-view chat for side-by-side model comparison
v0.4.0 (January 2026) added parallel inference with continuous batching
New headless "llmster" daemon enables server-only deployment on Linux boxes without the GUI

Formats: GGUF (llama.cpp backend), MLX (Apple Silicon only), safetensors. No EXL2 or GPTQ support.

API: OpenAI-compatible on localhost:1234. Python and TypeScript SDKs hit v1.0.0.

LM Studio is ideal for model evaluation — browse, download, compare side-by-side — before deploying with Ollama or vLLM in production.

vLLM — Production GPU Serving

If you're deploying models on GPU infrastructure at scale, vLLM is the industry standard. It's the performance leader with PagedAttention for memory-efficient KV cache management, continuous batching, and speculative decoding.

Benchmarks with Marlin kernels: AWQ achieves 741 tok/s, GPTQ achieves 712 tok/s. vLLM v0.16.0 (February 2026) expanded multi-GPU and multi-platform support to NVIDIA, AMD ROCm, Intel XPU, and TPU.

Formats: The widest range — safetensors, GPTQ, AWQ, FP8, NVFP4, bitsandbytes. This matters because GPU-optimized quantization formats like AWQ achieve better throughput than GGUF on NVIDIA hardware.

The catch: Linux-only for production, requires a dedicated NVIDIA/AMD GPU, complex setup compared to Ollama. Overkill for single-user local inference.

Use vLLM when: You're serving multiple users, need maximum throughput on GPU hardware, or are deploying in production. The common developer workflow is: evaluate models with LM Studio, develop with Ollama, deploy with vLLM.

> Railway

GPU Inference in the Cloud

When local hardware hits its limits, Railway deploys GPU workloads with autoscaling — vLLM, Ollama, and custom inference servers with zero DevOps.

[Try Railway Free]

llama.cpp — The Foundation

llama.cpp is the C/C++ inference engine that everything else builds on. Created by Georgi Gerganov, it pioneered running LLMs on consumer hardware via quantization. In February 2026, the ggml/llama.cpp team joined Hugging Face.

Ollama, LM Studio, GPT4All, and KoboldCpp all use llama.cpp under the hood. It's the engine — they're the interfaces.

Why use it directly?

Maximum control over inference parameters and model loading
Widest platform support: macOS, Windows, Linux, Android, iOS, WebAssembly
Best CPU inference performance — designed from the ground up for consumer hardware
Defines and maintains the GGUF format standard

Stats: 98.6k GitHub stars, 1,038 contributors, 28 upstream commits per week. CES 2026 NVIDIA optimizations yielded up to 35% faster token generation.

Use llama.cpp directly when you need fine-grained control that Ollama or LM Studio don't expose. Otherwise, use the higher-level tools — they give you 95% of the performance with much less configuration.

ExoLabs — Distributed Inference

Exo takes a fundamentally different approach: instead of running a model on one device, it splits the model across multiple devices connected peer-to-peer. No master-worker architecture — any device can contribute compute.

What's been demonstrated:

DeepSeek V3 (671B parameters) across 8 M4 Pro 64GB Mac Minis (512GB total memory) at ~5 tok/s
DeepSeek R1 (671B) across 7 Mac Minis + 1 M4 Max MacBook Pro (496GB total)
2 NVIDIA DGX Spark + M3 Ultra Mac Studio = 2.8x benchmark improvement through disaggregated inference

Why this works with Apple Silicon: Unified memory is ideal for Mixture-of-Expert (MoE) models. All 671B parameters load across the cluster, but only 37B are computed per inference step. Apple devices become surprisingly cost-effective for MoE architectures.

Current status: Alpha (v0.0.15-alpha public, 1.0 not yet released). macOS native app requires Tahoe 26.2+.

If you have multiple Macs, Exo lets you pool them into a single inference cluster. The constraint is total unified memory across devices — and the network connecting them.

For a deep dive on which Mac Mini to buy for local inference (with current Amazon pricing and used market analysis), see my complete Mac Mini buying guide for local LLMs.

Best Mac Mini for Local LLMs — Complete Buying Guide

Every Mac Mini config compared with current pricing, model compatibility tables, and used market analysis.

Other Notable Tools

Jan.ai

Open-source (AGPLv3) privacy-first desktop app. 41.1k stars, 5.3M+ downloads. Runs 100% offline via the Cortex engine (wraps llama.cpp). The standout feature is hybrid local + cloud switching — you can connect OpenAI, Anthropic, and local models in one interface, switching between them as needed. MCP integration for agentic workflows. Supports Windows ARM (Snapdragon).

LocalAI

The most comprehensive API-compatible local server. Drop-in replacement for OpenAI's API that supports text, images, audio, video, embeddings, and voice cloning — all locally. Multi-backend support (llama.cpp, vLLM, transformers, diffusers, MLX). Anthropic API support added January 2026. Best for: developers with existing OpenAI API code who want to run locally with minimal changes.

KoboldCpp

Single-executable fork of llama.cpp with an integrated web UI. "One file, zero install" — download, double-click, select a model. Triple API compatibility (KoboldAI + OpenAI + Ollama endpoints). The best tool for creative writing and roleplay with built-in memory, world info, author's notes, and SillyTavern integration.

GPT4All

Desktop app by Nomic AI with built-in LocalDocs for private document chat (RAG). The 2026 GPT4All Reasoner adds on-device reasoning with tool calling and code sandboxing. Backed by a funded company (Nomic AI). Best for non-technical users who want to chat with their documents privately.

MLX

Apple's open-source ML framework purpose-built for Apple Silicon. Not a user-facing app — a framework that other tools use as a backend. Leverages unified memory with zero CPU-GPU data copying. Built-in mixed-precision quantization (4/6/8-bit per layer). M5 Neural Accelerators provide up to 4x speedup for time-to-first-token. Swift API for native macOS/iOS apps.

Quantization Formats and Tradeoffs

Quantization compresses model weights from 16 bits per weight (FP16/BF16) down to fewer bits. This is what makes it possible to run a 70B parameter model on consumer hardware. For a deeper dive into every quantization level, model naming convention, and format comparison, see my LLM Model Names Decoded guide.

GGUF: The Universal Format

GGUF was created by llama.cpp and is used by Ollama, LM Studio, KoboldCpp, GPT4All, and Jan.ai. The "K-quant" variants use mixed precision per layer, allocating more bits to important layers.

Quant	Bits/Weight	Size (7B model)	Quality Retention	Best For
Q8_0	8-bit	~7.5 GB	~99% (near-lossless)	Maximum quality, enough RAM
Q6_K	6-bit	~5.5 GB	~97%	Quality-focused with moderate RAM
Q5_K_M	5-bit	~4.8 GB	~95%	Good balance
Q4_K_M	4-bit	~4.0 GB	~92% (sweet spot)	Most users
Q3_K_M	3-bit	~3.2 GB	~85%	Tight memory constraints
Q2_K	2-bit	~2.5 GB	~75%	Extreme compression

The practical ladder: Q4_K_M → Q5_K_M → Q6_K → Q8_0 as you get more memory. For most users, Q4_K_M is the sweet spot — 92% quality retention with 75% size reduction from FP16.

GPU-Optimized Formats

These formats are designed for NVIDIA GPUs and used by vLLM, ExLlamaV2, and transformers:

Format	Bits	Quality	Speed (Marlin)	Used By
AWQ	4-bit	~95%	741 tok/s	vLLM, transformers
GPTQ	4-bit	~90%	712 tok/s	vLLM, ExLlamaV2
EXL2	2-8 mixed	Variable	Fastest (single-user)	ExLlamaV2 / TabbyAPI
FP8	8-bit	~99%	Very fast	vLLM, llama.cpp
NVFP4	4-bit	~92%	Fastest (Blackwell)	llama.cpp, vLLM

AWQ vs GPTQ: AWQ consistently outperforms GPTQ in both quality (95% vs 90%) and speed. AWQ preserves activation-aware important weights. For most GPU users, AWQ is the better choice.

GGUF vs AWQ/GPTQ: GGUF is universal — runs on CPU, GPU, and Apple Silicon. AWQ/GPTQ are GPU-only but provide better throughput on NVIDIA hardware. Use GGUF for flexibility, AWQ for maximum GPU throughput.

Choosing the Right Tool

By Use Case

Scenario	Tool	Why
First time, just want to try	LM Studio	Visual GUI, one-click downloads
Developer, quick local testing	Ollama	One command, OpenAI-compatible API
Creative writing / roleplay	KoboldCpp	Built-in storytelling features
Private document chat	GPT4All	LocalDocs RAG built-in
Privacy-first desktop app	Jan.ai	Full offline, hybrid local/cloud
Production GPU serving	vLLM	Highest throughput, multi-GPU
Drop-in OpenAI replacement	LocalAI	Most complete API compatibility
Mac-native app development	MLX	Swift API, best Apple Silicon perf
Models too large for one device	Exo	Distributed inference
Maximum control	llama.cpp	The foundation

By Skill Level

Level	Recommended Tools
Beginner (no terminal)	LM Studio, GPT4All, Jan.ai
Intermediate (CLI)	Ollama, KoboldCpp
Advanced (Python/systems)	llama.cpp, MLX, LocalAI, vLLM
Expert (distributed)	Exo, vLLM multi-GPU

The Common Multi-Tool Workflow

Many developers in 2026 use a three-tool pipeline:

LM Studio for model discovery and evaluation (browse, download, compare side-by-side)
Ollama for development and integration (OpenAI-compatible API for app development)
vLLM for production deployment (maximum throughput on GPU infrastructure)

Hardware Buying Guide

The Fundamental Rule

For LLM inference, memory bandwidth is the bottleneck, not compute. A chip with higher GB/s generates tokens faster, even if it has fewer FLOPS. This is why an M3 Max (400 GB/s) generates tokens faster than an M4 Pro (273 GB/s) despite the M4 Pro being newer.

Memory Requirements by Model Size

Model Size	Min RAM (Q4)	Comfortable (Q6-Q8)	Example Models
3B	4 GB	6 GB	Phi-4-mini
7-8B	6 GB	10 GB	Llama 3.1 8B, Mistral 7B
13-14B	10 GB	16 GB	Llama 3.1 13B, Qwen 14B
30-34B	20 GB	32 GB	Codestral 22B
70B	40 GB	64 GB	Llama 3.1 70B, Qwen 72B
100B+	64 GB	128 GB+	Llama 3.1 405B (quantized)

Apple Silicon

Macs are uniquely suited for local LLMs because of unified memory — the GPU can access all system RAM, unlike discrete GPUs with fixed VRAM. RAM is not upgradeable on Apple Silicon. Buy the most you can afford.

Machine	Memory	Bandwidth	Price	Best For
Mac Mini M4	16-24 GB	120 GB/s	$599-799	7-14B, experimentation
Mac Mini M4 Pro	24-48 GB	273 GB/s	$1,399-1,999	Sweet spot. 70B at Q4 with 48GB
MacBook Pro M4 Pro	24-48 GB	273 GB/s	$1,999-2,499	Portable 70B inference
MacBook Pro M4 Max	48-128 GB	546 GB/s	$3,499-4,999	Fast 70B, moderate 100B+
Mac Studio M4 Ultra	128-512 GB	819 GB/s	$3,999-11,999	Run anything locally
MacBook Pro M5 Max	48-128 GB	TBD	$3,499+	Neural Accelerators, 4x TFT

Best value: Mac Mini M4 Pro 48GB (~$1,999) — runs 70B parameter models and costs less than a good GPU.

For a complete pricing breakdown of every Mac Mini configuration (new and used), with model compatibility tables and OpenClaw setup instructions, see my Mac Mini buying guide for local LLMs.

NVIDIA GPUs

VRAM is the limiting factor — models must fit in GPU VRAM or spill to CPU RAM at a significant speed penalty.

GPU	VRAM	Bandwidth	Price (2026)	Best For
RTX 3060 12GB	12 GB	360 GB/s	$250-300 (used)	Budget entry, 7B
RTX 3090 24GB	24 GB	936 GB/s	$800-1,000 (used)	Best budget for 13B
RTX 4090 24GB	24 GB	1,008 GB/s	$1,600-2,200	Balance. 13B full, 70B quantized
RTX 5090 32GB	32 GB	1,792 GB/s	$2,500-3,600+	Flagship. 2.6x faster than A100 on 7B
RTX 3090 x2	48 GB	1,872 GB/s	$1,600-2,000	Budget 70B on Linux with vLLM

Budget Tiers

Budget	Recommendation	What You Can Run
$0	Your existing machine + Ollama	3-7B on most modern hardware
$375	Used M1 Mac 16GB	7B models at decent speed
$599	Mac Mini M4 24GB	7-14B comfortably
$900	Used RTX 3090 (add to PC)	7-13B at GPU speed
$1,999	Mac Mini M4 Pro 48GB	70B models — best value in the market
$2,000	Used RTX 4090 (add to PC)	13B fast, 70B quantized
$3,500+	RTX 5090 or MBP M4/M5 Max	70B fast, frontier performance
$8,000+	Mac Studio M4 Ultra 192GB	Run anything

For building dedicated GPU inference servers at any budget ($150 to $5,000+), Digital Spaceport has the most comprehensive build guides I've found.

Get the free Ubuntu Bootstrap Script — 340-line idempotent VM setup for GPU inference servers with Zsh, Node.js, Docker, and 35+ tools.

Thought Leaders and Builder Strategies

These are the builders, researchers, and educators I've been learning from as I explore local inference. Whether they're building tools, training models, or documenting hardware builds, they're all making this ecosystem more accessible.

This list was inspired by 0xSero's thread on people to follow in the local inference space. 0xSero is one of the most active voices in the open-source AI community, and his recommendations pointed me to many of the builders profiled below.

0xSero (@0xSero)

One of the most active builders in the local inference community. Publishes quantized models on Hugging Face using Intel AutoRound, making large models runnable on consumer hardware. Built vllm Studio for managing local models with chat template proxies that make Hermes, MiniMax, and GLM models compatible with OpenAI and Anthropic API formats. Also created ai-data-extraction for extracting chat and code context data from AI coding assistants for ML training, and fine-tuned models like sero-nouscoder-14b-sft trained on real coding conversations.

Andrej Karpathy (@karpathy)

The best teacher in AI. nanochat is the definitive entry point for understanding LLM training — a full-stack pipeline in ~8,300 lines of clean PyTorch covering tokenization, pretraining, SFT, and reinforcement learning. Trains a 561M ChatGPT clone in ~4 hours for ~$100 (or ~$15 on spot instances).

What makes nanochat uniquely effective for learning: one dial — transformer depth. This single integer auto-determines all other hyperparameters, so you can understand the full pipeline without needing hyperparameter tuning expertise.

His latest project, autoresearch, uses AI agents to autonomously optimize nanochat training configurations — AI improving AI training.

Peter Steinberger (@steipete)

His GitHub is a treasure trove. Peekaboo (macOS screenshot automation for AI agents), Summarize (CLI that extracts/summarizes any URL, YouTube, PDF, or audio), and OpenClaw (the fastest-growing GitHub project at 180k+ stars — an autonomous AI assistant that lives on your computer and self-modifies its own code).

His design principle: "CLIs are the universal interface that both humans and AI agents can actually use effectively." Build CLI-first — it becomes the universal adapter between human workflows and agent automation.

Mario Zechner (@badlogicgames)

Pi is possibly the best, simplest open-source agentic loop to learn from. The pi-mono agent toolkit achieves power through radical minimalism: exactly 4 tools, a system prompt under 1,000 tokens, and a philosophy that "what you leave out matters more than what you put in." Pi became the engine behind OpenClaw.

His anti-MCP argument is worth considering: popular MCP servers like Playwright MCP (21 tools, 13.7k tokens) consume 7-9% of context window before work begins. Pi's alternative: CLI tools with README files — agents read the README only when needed, paying token cost only when necessary.

Takeaway: Start with 4 tools, not 40. Context engineering matters more than tool count.

Ahmad Osman (@TheAhmadOsman)

The GPU king. Moderator of r/LocalLLaMA, deep practical knowledge across NVIDIA, Mac, and Tenstorrent hardware. Hosts GPU giveaways with NVIDIA (RTX PRO 6000 Blackwell for GTC 2026) and regularly interviews open-weight labs. His key blog post — Stop Wasting Your Multi-GPU Setup With llama.cpp: Use vLLM or ExLlamaV2 for Tensor Parallelism — is essential reading for anyone with multiple GPUs.

@sudoingX

Pushing the limits of single-GPU inference. Ran Qwopus (Claude Opus 4.6 reasoning distilled into Qwen 3.5 27B) on a single RTX 3090 at 29-35 tok/s with thinking mode. Ran Qwen 3.5 9B on a single RTX 3060 — "5.3 GB of model on a card most people bought to play Warzone." Also discovered and published the fix for the Qwen 3.5 jinja template crash that broke OpenCode and Claude Code.

Takeaway: A single RTX 3090 can run 27B coding models at usable speeds — impressive for tasks like code completion and simpler agentic workflows.

Alex Cheema (@alexocheema)

Founder of ExoLabs. Oxford physics graduate. Pioneering distributed inference across Apple hardware — demonstrated 671B parameter models running across Mac Mini clusters. The Exo framework (42.7k stars) uses peer-to-peer topology with automatic device discovery and dynamic model partitioning. If you're interested in Mac Mini and Mac Studio clustering, this is the person to follow.

Digital Spaceport (@gospaceport)

The homelab hardware teacher. End-to-end AI server builds at every budget — from $150 entry-level to $5,000 quad-3090 builds. His Proxmox guides for Ollama + Open WebUI and vLLM are the best I've found.

Numman Ali (@nummanali)

Prolific CLI tool builder. cc-mirror creates isolated Claude Code variants with custom providers — your main installation stays untouched. Supports Z.ai, MiniMax, OpenRouter, Ollama, and local LLMs. Quick start: npx cc-mirror quick --provider mirror --name mclaude. Also building OpenSkills (cross-agent skill sharing) and an agent-native SDLC pipeline.

Takeaway: You don't need an Anthropic subscription to use Claude Code's interface. cc-mirror lets you point it at local or alternative models.

Get the Claude Code Config Pack — CLAUDE.md template, settings, hooks, and keybindings for the complete AI coding setup.

Dax Raad (@thdxr)

Creator of OpenCode — an open-source terminal-first AI coding agent with 120k+ stars, 75+ LLM providers, and zero data storage. Also built SST and models.dev. His grounded take: "The productivity feeling is real. The productivity isn't." OpenCode is vendor lock-in free — use any model provider.

Julia Turc (@juliarturc)

The compression scientist. Her paper Well-Read Students Learn Better (706+ citations) proved that pre-training compact models before distillation yields compound improvements — foundational research for how modern quantized models work. Now building Storia.ai (YC S24). Her YouTube channel explains deep AI concepts without the hype.

Teknium (@Teknium1)

Head of Post-Training at Nous Research ($1B valuation). Co-creator of the Hermes 4 model family (open-weight, hybrid reasoning, up to 405B parameters). Built DataForge for graph-based synthetic data generation. The OpenHermes 2.5 dataset (1M samples) is openly available. Also drove decentralized training via INTELLECT-2 — a 32B model trained across 100+ GPUs on 3 continents.

Open-Weight Model Labs

Several people are driving the open-weight model ecosystem forward:

Victor Mustar (@victormustar) — Head of Product at Hugging Face, shaping the UX of the platform hosting the world's largest open model collection.
Z.ai Community (@louszbd) — GLM-5 is 744B parameters (40B active), MIT licensed, #1 among open models on Text Arena with day-0 vLLM/SGLang support.
Skyler Miao (@SkylerMiao7) — Head of Engineering at MiniMax. M2 is 230B total / 10B active, MoE architecture that scores well on benchmarks while being very cost-efficient to run. API pricing: $0.30/M input tokens.

Also Worth Following

@Ex0byt — Making local inference on massive models possible on consumer hardware
@alexinexxx — GPU kernel programming learner with strong drive and educational content
@crystalsssup — Building top open-weight models and releasing research openly

How to Connect Obsidian to Claude Code — Complete Integration Guide

5 strategies for using Obsidian as your knowledge base with AI coding agents.

10 CLI Tools Every Developer Should Use with AI Coding Agents

LazyGit, Glow, Zoxide, Btop, and more — the terminal toolkit for AI-assisted development.

> StarMorph Config

Curated dev configs, dotfiles, and bootstrap scripts — managed, versioned, always current.

[Browse Configs]

Key Themes

1. The barrier to entry keeps dropping. Karpathy trains a ChatGPT clone for $15-100. Consumer hardware runs models that were data-center-only a year ago. You can start experimenting for $0 with Ollama on your existing machine.

2. Consumer GPUs are more capable than you'd expect. @sudoingX runs 27B coding models on a single RTX 3090 at usable speeds. Digital Spaceport documents builds starting at $150.

3. Apple Silicon clustering is an interesting frontier. Exo Labs runs 671B parameter models across Mac Mini clusters. Unified memory + MoE is surprisingly effective for the price.

4. Agent architecture should be minimal. Pi proves 4 tools and a 1,000-token system prompt outperforms bloated frameworks. Context engineering matters more than tool count.

5. Open-weight models are genuinely useful. GLM-5 (MIT), MiniMax M2, Hermes 4, Qwen — strong performance across many tasks, openly available. They're great for simple workflows, privacy-sensitive tasks, and offline use. For complex reasoning and agentic coding, frontier cloud models still have a clear edge.

6. Local and cloud are complementary. cc-mirror and OpenCode let you use familiar interfaces with local or alternative models. The best setup for most developers is probably both — cloud for hard tasks, local for everything else.

This field evolves fast. I'm still early in my own local inference journey — learning what works, what's overhyped, and where the real value is. If you're curious, the easiest way to start is ollama run llama3 on your existing machine and see what it can do. No commitment, no cost.

Get the full 14-page StarMorph Research PDF — detailed comparison tables, hardware buying guide, and thought leader profiles.

Some links in this article are affiliate links. If you purchase through them, I may earn a small commission at no extra cost to you. I only recommend products I actually use.

Sources

Research Papers

arXivSurvey on LLM Inference Engines and Optimization arXivSpeculative Decoding: Accelerating LLM Inference (2026)arXivPost-Training Quantization for LLMs (2025 Survey)arXivEfficient Weight Quantization for On-Device LLMs arXivMemory-Efficient LLM Finetuning (2025)

Tools & Documentation

llama.cpp GitHub Ollama vLLM Documentation MLX Framework (Apple)LM Studio

Table of Contents