BaseRT يوفر 1.56x أسرار LLM أسرع على Apple Silicon من قبل Going Native Metal

تم تصميم وقت التشغيل الجديد الذي تم إنشاؤه مباشرة على المعدات وتعديلًا لهيكل الذاكرة المشتركة لـ Apple يسرع llama.cpp وMLX بنسبة 56٪ في حجم التشفير، مع المزيد من الارتفاع على النماذج Mixture-of-Experts.

basertapple siliconmetalllm inferenceedge inferencellamacpp

BaseRT delivers up to 1.56x faster decode throughput on Apple Silicon than llama.cpp, and that's just the start against MLX the margin sits at 1.35x. Those numbers come from real benchmarks on M3 and M4 Pro devices running Q4 and Q8 quantizations of Qwen3, Llama 3.2, and Gemma 4 at parameter counts from sub-1B to 30B. No synthetic toy models — these are production-scale families.

Why Existing Runtimes Leave Performance on the Table

llama.cpp and MLX-based frameworks are built on abstractions designed for CPUs or GPUs with separate memory pools. Apple Silicon's unified memory topology — where CPU, GPU, and Neural Engine share the same physical memory — behaves differently. Those abstractions introduce dispatch overhead and miss opportunities for kernel fusion that Metal's execution model rewards. BaseRT ditches the adapter layers entirely. Every kernel is hand-rolled for Metal, dispatch logic is custom, and memory allocation is aware that the GPU can directly access the same pool without copying.

Measured Gains That Matter for Edge Inference

The headline decode numbers are solid, but the prefill advantage for mixture-of-experts models is even more striking. BaseRT's chip-specific kernel fusion directly targets the sparse activation patterns of MoE layers, recovering latency that general-purpose runtimes sacrifice. On M4 Pro, a 30B MoE model sees prefill speedups well beyond the decode ratio. That matters because prefill latency is the bottleneck for interactive applications like voice assistants and local code completion.

These results put Apple M-series hardware into a different light. Privacy requirements, latency constraints, and rising cloud costs are pushing inference toward on-device deployment. BaseRT shows that a runtime purpose-built for the hardware — not ported from another ecosystem — can make local LLM inference genuinely competitive. The code is public on GitHub under basecompute/baseRT, so anyone with an M1 or later can replicate these numbers today.

Source: BaseRT: Best-in-Class LLM Inference on Apple Silicon via Native Metal
Domain: arxiv.org

BaseRT يوفر 1.56x أسرار LLM أسرع على Apple Silicon من قبل Going Native Metal

Why Existing Runtimes Leave Performance on the Table

Measured Gains That Matter for Edge Inference

More in Artificial Intelligence