Source linked

KV Cache Pooling Slashes لفترة طويلة LLM 4x

يحدد جمعية KV المشتركة والتجسس على الكورلور خفض تقييم الوقت إلى الأول من 30.7 إلى أقل من 10 على النماذج 8B-30B.

unified kv poolingkv cachelong context llmspdkllm servingfilesystem overhead

Time-to-first-token for long-context LLM serving clocks in at 30.7 seconds - more than 3x the 10-second threshold most interactive applications tolerate. That's not a pipeline issue; it's a storage architecture problem.

Why 30.7 Seconds is Unacceptable

Running LLaMA 3.1-8B, GPT-OSS-20B, or Qwen3-30B-A3B with long contexts forces KV cache offload to host memory and SSDs. Current caching mechanisms weren't designed for contexts that spill past GPU memory. Two root causes: retrieval is serialized across host-memory and SSD, leaving other modules idle. Worse, SSD-based KV retrieval spends 84% of its time burning cycles in the kernel filesystem - not actually reading or writing data.

KV-Passthrough: Bypassing the Kernel Tax

The authors - paper 2606.14779 - built a unified KV pooling layer that aggregates multiple host-memory modules and SSDs into a single logical pool. KV caches are placed onto whichever device offers the bandwidth they need. But the real trick is KV-passthrough: it bypasses the kernel filesystem entirely and talks directly to SSDs from user space via the Storage Performance Development Kit (SPDK). That single change slashes blocked I/O time by up to 23.2x.

4.1x Faster and Under 10 Seconds

Across evaluations on LLaMA 3.1-8B, GPT-OSS-20B, and Qwen3-30B-A3B, unified KV pooling cuts TTFT by roughly 4.1x compared to state-of-the-art techniques. Every model configuration lands under the 10-second mark. No exotic hardware, no model surgery - just a smarter storage abstraction and a kernel bypass that should have been standard years ago.

Long-context serving won't scale until the storage layer stops pretending SSDs are files. This patch is a clean proof that the bottleneck is software, not silicon.


Source: Unified KV Pooling to Accelerate Long-Context LLM Serving
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.