Local Gemma 4 Coding Agent Hits 72 Tokens/s on M1 Max With MTP

72.2 tokens per second out of a 26B parameter model on a three-year-old laptop chip. That’s the throughput Kyle Howells squeezed from a local coding agent on a 64 GB M1 Max Mac, running Gemma 4 26B-A4B quantized to Q4 with Multi-Token Prediction (MTP) speculation. No cloud, no GPU cluster, just llama.cpp with Metal acceleration and a 17 GB model folder.

The Trick: Q4 Main Model + Q8 MTP Draft Head

Howells started with a plain llama.cpp Metal baseline: 58.2 generation tok/s for a 128-token benchmark prompt. Adding the Gemma 4 MTP draft model—a Q8 quantized GGUF file from the same unsloth repo—bumped that to 72.2 tok/s with --spec-draft-n-max 3. That’s a 1.24x speedup, and prompt processing stayed nearly identical at ~296 tok/s. Sweeping draft token counts from 1 to 6 showed 3 was optimal on the M1 Max, with 2 trailing at 72.0 tok/s and anything over 4 dropping below baseline.

MLX fans, take note. Howells also tested mlx-lm with three different 4-bit variants: Unsloth UD, mlx-community, and mlx-community OptiQ. The fastest MLX run—Unsloth UD 4-bit—hit 45.8 tok/s, far behind both llama.cpp setups. “I thought MLX … would be fastest,” he writes. Instead, llama.cpp’s years of cross-platform optimization pulled ahead, even on Apple silicon.

Image Support Without the Slowdown

To feed the agent screenshots, Howells loaded the Gemma 4 multimodal projector (mmproj-BF16.gguf) via llama.cpp’s --mmproj flag. Re-running the text benchmark with the projector active showed no generation speed drop—still 72.2 tok/s. The only catch: the 12B variant is natively multimodal; the 26B needs the separate projector file. Pi (the terminal coding agent) now sees image tool output because llama.cpp advertises multimodal capabilities when the projector is loaded.

The full setup uses an OpenAI-compatible API, meaning tools like Pi can swap in this local agent transparently. Howells published the exact build commands, dependency installs, and model download steps—you’ll need brew install cmake git tmux [email protected], then a llama.cpp build with GGML_METAL=ON and GGML_ACCELERATE=ON. The model folder lands around 17 GB total. No internet required after download.

Next step: someone will push the MTP draft head into a proper speculative decoding pipeline for multi-agent systems. 72 tok/s on a laptop is already usable for real-time tool calls; a second draft model or a quantization-optimized draft could push that into territory where local coding agents feel indistinguishable from a remote API.

Source: How to Setup a Local Coding Agent on macOS
Domain: ikyle.me

Local Gemma 4 Coding Agent Hits 72 Tokens/s on M1 Max With MTP

The Trick: Q4 Main Model + Q8 MTP Draft Head

Image Support Without the Slowdown

More in Artificial Intelligence