Source linked

Air-Gapped Qwen3.6 на M3 Pro закрывает инцидент Kubernetes для PR

har-ki.github.io@systems_wire2 days ago·Systems Engineering·10 comments

После четырех исправлений локальная модель Qwen3.6 на M3 Pro взяла инцидент с Kubernetes от расследования до запроса на вывод, все оффлайн

claude codeqwen36ollamaapple siliconair gappedincident response

A 35.1B-parameter mixture-of-experts model running entirely on a single laptop took a Kubernetes incident from investigation to an open pull request — root cause found, patch written, branch pushed, PR filed — with no data leaving the machine. That's not a frontier model on a cloud GPU. It's Qwen3.6:35b-a3b-coding-nvfp4 on an Apple M3 Pro with 36 GiB unified memory.

The Rig That Made It Work

Hardware: M3 Pro, 18 GPU cores, 36 GiB unified memory, ~150 GB/s memory bandwidth. Model: qwen3.6:35b-a3b-coding-nvfp4 — 35.1B parameters, mixture-of-experts with ~3B active per token, NVFP4 quantization. 21 GB on disk, ~20 GiB resident once loaded. Runtime: Ollama 0.24.0 with the MLX runner (Apple's Silicon-native path, not the llama.cpp/Metal backend). Client: Claude Code v2.1.84 pointed at the local Ollama endpoint.

Key environment: OLLAMA_CONTEXT_LENGTH=32768 — that 32K window is what 36 GiB buys you. OLLAMA_MLX=1, OLLAMA_FLASH_ATTENTION=1, OLLAMA_MULTIUSER_CACHE=1, OLLAMA_KEEP_ALIVE=24h. No ANTHROPIC_API_KEY set — that's what forces Claude Code to hit localhost instead of Anthropic's cloud.

The Four Fixes That Turned a Timeout Into a Workflow

First attempt: ten minutes of thinking, then timeout — no tool calls, zero output. The model spent its entire session reasoning because Claude Code's thinking mode was on and unbounded on a model that outputs ~5–8 tokens/sec.

Fix one: MAX_THINKING_TOKENS=0. Disabling reasoning forced the model to emit tool calls immediately instead of chasing an infinite chain of thought. Fix two: Ollama 0.24.0 or newer — older versions lack the MLX runner. Fix three: run ollama serve with those tuned env vars as a launchd service so the 20 GiB model stays resident. Fix four: accept that the first turn (prefill of ~25,000 tokens) takes about 60 seconds, and the initial burst of 404s in the Ollama log is harmless.

After those four changes, the same laptop ran kubectl get pods -A, found the unhealthy pod, traced the root cause, wrote the patch, pushed a branch, and filed a PR with gh. Total session: 34 minutes. The loop closed.

What Your Hardware Decides

MoE architecture is what makes this possible — only ~3B active parameters per token, so runtime cost resembles a 14B dense model while answer quality approaches 35B. A dense 35B won't fit in 36 GiB. The 32K context window is directly limited by memory bandwidth and capacity. More unified memory (48 GiB, 64 GiB) would raise that window and likely shrink the 60-second prefill bottleneck.

The takeaway for regulated environments: capable AI-driven incident response is possible without crossing a firewall — the only debate is how fast your hardware can prefll.


Source: Running Claude Code Offline on an M3 Pro with Qwen3.6
Domain: har-ki.github.io

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.