CPU LLM Inference Gets 11.5x Speedup by Keeping Weights Cache-Resident

2.04x to 11.51x speedup on time-per-output-token for Llama-3.2-3B and Llama-2-7B is what a cache-resident execution model delivers on multi-socket CPU clusters with 3D-stacked last-level caches. That is not a typo. This work, from a recent arXiv paper (2606.25353), shows that commodity server CPUs can beat widely used GPU-based inference engines when execution is reorganized around cache residency.

Why Data Movement Dominates LLM Inference

Every LLM inference step shuffles hundreds of gigabytes of weights through the memory hierarchy. GPUs solve this with massive bandwidth and HBM. CPUs traditionally cannot keep up. But 3D-stacked SRAM or DRAM caches now give server CPUs GB-scale last-level caches that rival GPU memory bandwidth within the socket. The trick is keeping the weights there.

Deeper pipelining for weight residency increases in-flight requests and KV-cache footprint. Operator-boundary synchronization becomes a visible bottleneck when the weights never leave the cache. The paper's solution separates weight-centric operators from attention and KV-cache management into dedicated resource domains, keeping reusable weights cache-resident while scaling KV capacity independently of pipeline depth. It relaxes synchronization from operator boundaries to true sub-operator dependencies, reducing coordination overhead.

Prototype Beats llama.cpp by a Wide Margin

The authors instantiated this model on a multi-socket CPU cluster using a weight-attention decoupled architecture with locality-aware placement and a specialized static runtime. On deployed Llama-3.2-3B and Llama-2-7B configurations, their prototype achieved 2.04x to 11.51x speedup on time-per-output-token compared to equally provisioned llama.cpp. That is not an exotic hardware setup - it is off-the-shelf CPUs with large last-level caches.

A validated analytical model extends the result: up to 13.9x TPOT speedup across model sizes, context lengths, and batch sizes. These numbers come from modeling cache bandwidth saturation and sub-operator dependency graphs, not hand-waving.

What This Means for LLM Inference on CPUs

This work demonstrates that the memory hierarchy bottleneck in LLM inference is not fundamental. By designing execution to match the cache topology - decoupling weight access from state management, synchronizing only on true dependencies - CPUs with GB-scale last-level caches become viable inference platforms for small-to-medium models. The same principles likely extend to larger models with careful tiling.

Do not expect this to replace GPUs for training or massive batch serving. But for latency-sensitive single-batch inference, where GPU idle time and PCIe transfers dominate, cache-resident CPU execution may be the cheaper, lower-power alternative. The paper gives a concrete recipe: keep weights pinned, decouple attention, and synchronize at sub-operator granularity.

Source: Cache-Resident LLM Inference in GB-Scale Last-Level Caches
Domain: arxiv.org

CPU LLM Inference Gets 11.5x Speedup by Keeping Weights Cache-Resident

Why Data Movement Dominates LLM Inference

Prototype Beats llama.cpp by a Wide Margin

What This Means for LLM Inference on CPUs

More in Artificial Intelligence