BaseRT delivers up to 1.56x faster decode throughput on Apple Silicon than llama.cpp, and that's just the start against MLX the margin sits at 1.35x. Those numbers come from real benchmarks on M3 and M4 Pro devices running Q4 and Q8 quantizations of Qwen3, Llama 3.2, and Gemma 4 at parameter counts from sub-1B to 30B. No synthetic toy models — these are production-scale families.
Why Existing Runtimes Leave Performance on the Table
llama.cpp and MLX-based frameworks are built on abstractions designed for CPUs or GPUs with separate memory pools. Apple Silicon's unified memory topology — where CPU, GPU, and Neural Engine share the same physical memory — behaves differently. Those abstractions introduce dispatch overhead and miss opportunities for kernel fusion that Metal's execution model rewards. BaseRT ditches the adapter layers entirely. Every kernel is hand-rolled for Metal, dispatch logic is custom, and memory allocation is aware that the GPU can directly access the same pool without copying.
Measured Gains That Matter for Edge Inference
The headline decode numbers are solid, but the prefill advantage for mixture-of-experts models is even more striking. BaseRT's chip-specific kernel fusion directly targets the sparse activation patterns of MoE layers, recovering latency that general-purpose runtimes sacrifice. On M4 Pro, a 30B MoE model sees prefill speedups well beyond the decode ratio. That matters because prefill latency is the bottleneck for interactive applications like voice assistants and local code completion.
These results put Apple M-series hardware into a different light. Privacy requirements, latency constraints, and rising cloud costs are pushing inference toward on-device deployment. BaseRT shows that a runtime purpose-built for the hardware — not ported from another ecosystem — can make local LLM inference genuinely competitive. The code is public on GitHub under basecompute/baseRT, so anyone with an M1 or later can replicate these numbers today.
Source: BaseRT: Best-in-Class LLM Inference on Apple Silicon via Native Metal
Domain: arxiv.org
Comments load interactively on the live page.