Source linked

KV Cache Pooling Slashes Long-Context LLM Latency 4x

Unified KV pooling and a kernel-bypass trick cut time-to-first-token from 30.7s to under 10s across 8B-30B models.

unified kv poolingkv cachelong context llmspdkllm servingfilesystem overhead

Time-to-first-token for long-context LLM serving clocks in at 30.7 seconds - more than 3x the 10-second threshold most interactive applications tolerate. That's not a pipeline issue; it's a storage architecture problem.

Why 30.7 Seconds is Unacceptable

Running LLaMA 3.1-8B, GPT-OSS-20B, or Qwen3-30B-A3B with long contexts forces KV cache offload to host memory and SSDs. Current caching mechanisms weren't designed for contexts that spill past GPU memory. Two root causes: retrieval is serialized across host-memory and SSD, leaving other modules idle. Worse, SSD-based KV retrieval spends 84% of its time burning cycles in the kernel filesystem - not actually reading or writing data.

KV-Passthrough: Bypassing the Kernel Tax

The authors - paper 2606.14779 - built a unified KV pooling layer that aggregates multiple host-memory modules and SSDs into a single logical pool. KV caches are placed onto whichever device offers the bandwidth they need. But the real trick is KV-passthrough: it bypasses the kernel filesystem entirely and talks directly to SSDs from user space via the Storage Performance Development Kit (SPDK). That single change slashes blocked I/O time by up to 23.2x.

4.1x Faster and Under 10 Seconds

Across evaluations on LLaMA 3.1-8B, GPT-OSS-20B, and Qwen3-30B-A3B, unified KV pooling cuts TTFT by roughly 4.1x compared to state-of-the-art techniques. Every model configuration lands under the 10-second mark. No exotic hardware, no model surgery - just a smarter storage abstraction and a kernel bypass that should have been standard years ago.

Long-context serving won't scale until the storage layer stops pretending SSDs are files. This patch is a clean proof that the bottleneck is software, not silicon.


Source: Unified KV Pooling to Accelerate Long-Context LLM Serving
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.