Source linked

CXL-Based KV Cache Cuts TTFT by 9.7x for Sparse Attention LLMs

SAC fetches only the active top-k KV entries via CXL at cache-line granularity, delivering 2.1x higher throughput and 9.7x lower time-to-first-token over RDMA baselines on DeepSeek-V3.2.

saccxldeepseek v32sglangkv cachesparse attention

9.7x lower time-to-first-token on DeepSeek-V3.2 by swapping RDMA for CXL - that's the number that jumped off the arXiv preprint for SAC.

The Memory Wall Hits Long-Context Inference

LLM serving for long contexts has flipped the bottleneck: it's no longer flops, it's memory capacity. Traditional disaggregated KV cache systems lean on RDMA to yank the entire prefix cache from a remote pool into local GPU memory before decoding starts. That works for dense attention, where every token matters. But sparse attention models - think DeepSeek-V3.2 with its MoE and sparse patterns - only activate a small fraction of KV entries per decoding step. Pulling the whole cache anyway wastes bandwidth, bloats local memory, and inflates latency.

SAC's CXL-Granularity On-Demand Fetch

The authors built SAC, the first disaggregated KV cache system purpose-built for sparse attention. Instead of coarse-grained RDMA transfers, SAC leverages Compute Express Link (CXL) - specifically its cache-line granularity load/store semantics. CXL lets SAC fetch only the top-k KV entries on demand, right when the sparse attention mechanism needs them. No prefetching the full prefix. No local memory bloat.

Real Numbers on Real Hardware

Evaluations on DeepSeek-V3.2 using the SGLang serving framework tell the story: SAC achieves 2.1x higher throughput, 9.7x lower time-to-first-token (TTFT), and 1.8x lower time-between-tokens (TBT) compared to RDMA-based disaggregation baselines. The CXL approach doesn't just shave latency - it eliminates the fundamental mismatch between dense-fetch semantics and sparse computation.

This shifts the default infrastructure choice for any inference stack running sparse attention models. CXL disaggregation isn't a theoretical curiosity anymore; it's the production path for long-context serving at scale.


Source: SAC: Disaggregated KV Cache System for Sparse Attention LLMs with CXL
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.