claude code · ollama
Claude CLI + Ollama for Free Local Dev
Pair the Claude Code CLI with a local Ollama server so you can keep the agentic workflow you like while pointing it at an open-weight model running on your own hardware — no API spend, no data leaving your laptop. Here's the short version, plus what we've found works best on Apple Silicon.
1. Install the Claude Code CLI
The official one-liner from code.claude.com. Works on macOS and Linux.
curl -fsSL https://claude.ai/install.sh | bashVerify with claude --version. On Windows, install via WSL2 and follow the Linux path.
2. Install Ollama
Linux
curl -fsSL https://ollama.com/install.sh | shmacOS
Grab the native app from ollama.com/download (it ships an .app bundle and a menu-bar daemon), or via Homebrew:
brew install --cask ollamaOnce installed, Ollama listens on http://localhost:11434. Confirm with curl http://localhost:11434 — you should see Ollama is running.
3. Pull a coding model
Browse the catalog at ollama.com/library. For agentic CLI use, we currently lean on Qwen3-Coder when there's enough RAM, and fall back to Qwen2.5-Coder on smaller machines.
ollama pull qwen3-coderThe default tag is the 30B variant (~19GB on disk). Pick an explicit size with a tag, e.g. ollama pull qwen2.5-coder:7b. List what you have with ollama list.
4. Point Claude Code at Ollama
Claude Code respects the ANTHROPIC_BASE_URL override, so an inline env-var invocation is the cleanest way to route a single session through Ollama without affecting the rest of your shell:
ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 claude --model qwen3-coderANTHROPIC_AUTH_TOKEN just needs to be non-empty — Ollama ignores it. --modeltakes any tag you've pulled.
Stay fully offline
Add CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 to suppress telemetry/update pings:
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 claude --model qwen2.5-coder:7bMake it a shell alias
If you do this often, drop this in ~/.zshrc or ~/.bashrc:
alias claude-local='ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 claude --model qwen3-coder'What works best on Apple Silicon
Apple Silicon's unified memory is a huge win for local models — RAM is GPU memory. Ollama uses Metal under the hood, and you don't need to tune anything. The constraint is just how much memory you have versus how much the model wants. Rough guidance from what we've tested:
| Machine | Sweet-spot model | Notes |
|---|---|---|
| M1/M2/M3, 8GB | qwen2.5-coder:3b or gemma3:4b | Workable for completion and small edits; agentic loops feel cramped. |
| M1/M2/M3, 16GB | qwen2.5-coder:7b | The realistic floor for Claude Code-style agentic work. Leave a few GB for your editor & browser. |
| M2/M3/M4 Pro, 24–32GB | qwen2.5-coder:14b | Noticeable jump in tool-use reliability versus the 7B. |
| M2/M3/M4 Max, 36–64GB | qwen3-coder:30b | Our default — sub-30B MoE that's genuinely useful as an agent. Fits comfortably with headroom. |
| M2/M3 Ultra, 128GB+ | qwen3-coder:30b, larger MoEs | You have room for bigger weights (e.g. gpt-oss:120b), but 30B is usually the better latency/quality trade. |
Rule of thumb: a Q4-quantized model needs roughly parameters × 0.6 GB of RAM, plus 2–4 GB for the KV cache once you start filling the context. If ollama ps shows your model spilling to CPU, drop a size.
Tips & gotchas
- Local models are much less reliable at multi-step tool use than hosted Claude. Use them for autocomplete-style help, code Q&A, and short edits — not for unattended agentic runs.
- Context windows in Ollama default to 4K–8K tokens. Bump it with
OLLAMA_CONTEXT_LENGTH=32768before starting the server, or setnum_ctxin a customModelfile. Larger contexts eat RAM fast. - First token latency is dominated by the model loading into memory. Keep Ollama running between sessions (it's already a background service) so the weights stay warm.
- If
claude --model <name>errors with "model not found", double-checkollama list— the tag must match exactly, including the size suffix. - The default Ollama port is open to
localhostonly. If you want to share the server across machines on your network, setOLLAMA_HOST=0.0.0.0:11434— but think twice about doing this on a coffee-shop Wi-Fi.
Bottom line: on a 32GB+ Apple Silicon machine, qwen3-codervia Ollama plus the Claude Code CLI gets you a usable, fully-offline pair-programmer for free. It's not Sonnet — but it's a great fallback for flights, airgapped work, or just keeping your token bill at zero.