claude code · ollama

Claude CLI + Ollama for Free Local Dev

Pair the Claude Code CLI with a local Ollama server so you can keep the agentic workflow you like while pointing it at an open-weight model running on your own hardware — no API spend, no data leaving your laptop. Here's the short version, plus what we've found works best on Apple Silicon.

1. Install the Claude Code CLI

The official one-liner from code.claude.com. Works on macOS and Linux.

curl -fsSL https://claude.ai/install.sh | bash

Verify with claude --version. On Windows, install via WSL2 and follow the Linux path.

2. Install Ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

macOS

Grab the native app from ollama.com/download (it ships an .app bundle and a menu-bar daemon), or via Homebrew:

brew install --cask ollama

Once installed, Ollama listens on http://localhost:11434. Confirm with curl http://localhost:11434 — you should see Ollama is running.

3. Pull a coding model

Browse the catalog at ollama.com/library. For agentic CLI use, we currently lean on Qwen3-Coder when there's enough RAM, and fall back to Qwen2.5-Coder on smaller machines.

ollama pull qwen3-coder

The default tag is the 30B variant (~19GB on disk). Pick an explicit size with a tag, e.g. ollama pull qwen2.5-coder:7b. List what you have with ollama list.

4. Point Claude Code at Ollama

Claude Code respects the ANTHROPIC_BASE_URL override, so an inline env-var invocation is the cleanest way to route a single session through Ollama without affecting the rest of your shell:

ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 claude --model qwen3-coder

ANTHROPIC_AUTH_TOKEN just needs to be non-empty — Ollama ignores it. --modeltakes any tag you've pulled.

Stay fully offline

Add CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 to suppress telemetry/update pings:

CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 claude --model qwen2.5-coder:7b

Make it a shell alias

If you do this often, drop this in ~/.zshrc or ~/.bashrc:

alias claude-local='ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 claude --model qwen3-coder'

What works best on Apple Silicon

Apple Silicon's unified memory is a huge win for local models — RAM is GPU memory. Ollama uses Metal under the hood, and you don't need to tune anything. The constraint is just how much memory you have versus how much the model wants. Rough guidance from what we've tested:

Machine	Sweet-spot model	Notes
M1/M2/M3, 8GB	`qwen2.5-coder:3b` or `gemma3:4b`	Workable for completion and small edits; agentic loops feel cramped.
M1/M2/M3, 16GB	`qwen2.5-coder:7b`	The realistic floor for Claude Code-style agentic work. Leave a few GB for your editor & browser.
M2/M3/M4 Pro, 24–32GB	`qwen2.5-coder:14b`	Noticeable jump in tool-use reliability versus the 7B.
M2/M3/M4 Max, 36–64GB	`qwen3-coder:30b`	Our default — sub-30B MoE that's genuinely useful as an agent. Fits comfortably with headroom.
M2/M3 Ultra, 128GB+	`qwen3-coder:30b`, larger MoEs	You have room for bigger weights (e.g. `gpt-oss:120b`), but 30B is usually the better latency/quality trade.

Rule of thumb: a Q4-quantized model needs roughly parameters × 0.6 GB of RAM, plus 2–4 GB for the KV cache once you start filling the context. If ollama ps shows your model spilling to CPU, drop a size.

Tips & gotchas

Local models are much less reliable at multi-step tool use than hosted Claude. Use them for autocomplete-style help, code Q&A, and short edits — not for unattended agentic runs.
Context windows in Ollama default to 4K–8K tokens. Bump it with OLLAMA_CONTEXT_LENGTH=32768 before starting the server, or set num_ctx in a custom Modelfile. Larger contexts eat RAM fast.
First token latency is dominated by the model loading into memory. Keep Ollama running between sessions (it's already a background service) so the weights stay warm.
If claude --model <name> errors with "model not found", double-check ollama list — the tag must match exactly, including the size suffix.
The default Ollama port is open to localhost only. If you want to share the server across machines on your network, set OLLAMA_HOST=0.0.0.0:11434 — but think twice about doing this on a coffee-shop Wi-Fi.

Bottom line: on a 32GB+ Apple Silicon machine, qwen3-codervia Ollama plus the Claude Code CLI gets you a usable, fully-offline pair-programmer for free. It's not Sonnet — but it's a great fallback for flights, airgapped work, or just keeping your token bill at zero.

Back to Nerd Stuff