Source linked

Local Models Hit 75% of Frontier Accuracy for Agentic Coding

Vicki Boykis reports Gemma-4-12b-qat running locally on a Mac reaches ~75% of frontier model performance for agentic coding loops, making local models finally viable for real work.

vicki boykisgoogle gemma 4lm studiopi agentlocal modelsagentic coding

Agentic coding loops running entirely on a 2022 M2 Mac with 64GB RAM now hit about 75% the accuracy and speed of frontier API models. Vicki Boykis, who has run local models since they first appeared, finally calls them "surprisingly good now" after testing Gemma-4-12b-qat through LM Studio and the Pi agent harness.

Gemma 4 Changed the Calculus

Boykis worked through Mistral 7B, Gemma 3, OpenAI OSS-20B, Qwen 3 MOE, and several Qwen 2.5 Coder variants across llama.cpp, Ollama, llamafiles, and Open WebUI. None crossed her personal threshold: "do I have to double-check it against an API model?" GPT-OSS came close, but Gemma-4-26b-a4b and especially the newer Gemma-4-12b-qat made the leap. Her actual tasks: refactoring a Python notebook into 5-6 modules with correct type hints, writing unit tests, bootstrapping a two-tower recommendation repo from scratch. These used to be impossible locally six months ago.

The architecture of Gemma-4-12b-qat itself raises interesting questions about performance-constrained tradeoffs - a line of inquiry the token gold rush largely ignored.

How to Run Agentic Models Locally Today

You need three pieces: a local inference engine (Boykis uses LM Studio), an agentic harness (Pi), and the model artifact. The setup is straightforward: point Pi's models.json at the local endpoint running the downloaded model. Boykis runs everything inside Docker for security, giving Pi only bash access and no Python execution or web browsing. Her Docker Compose config mounts a custom models.json that sets lmstudio as the provider with google/gemma-4-12b-qat and routes through http://host.docker.internal:1234/v1. The Pi session runs in a container with limited permissions and mounts the workspace as a volume.

Her launch script is a simple bash wrapper that sets WORKSPACE and calls docker compose up. The KV cache still grows to 64GB RAM during extended sessions, but the speed and accuracy make that tradeoff acceptable.

Next, expect more developers to copy this pattern: a local 12B QAT model paired with a well-configured agent harness and Docker sandbox. The boundary between local and API just shifted decisively.


Source: Running local models is good now
Domain: vickiboykis.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.