Time-to-first-token for long-context LLM serving clocks in at 30.7 seconds - more than 3x the 10-second threshold most interactive applications tolerate. That's not a pipeline issue; it's a storage architecture problem.
Why 30.7 Seconds is Unacceptable
Running LLaMA 3.1-8B, GPT-OSS-20B, or Qwen3-30B-A3B with long contexts forces KV cache offload to host memory and SSDs. Current caching mechanisms weren't designed for contexts that spill past GPU memory. Two root causes: retrieval is serialized across host-memory and SSD, leaving other modules idle. Worse, SSD-based KV retrieval spends 84% of its time burning cycles in the kernel filesystem - not actually reading or writing data.
KV-Passthrough: Bypassing the Kernel Tax
The authors - paper 2606.14779 - built a unified KV pooling layer that aggregates multiple host-memory modules and SSDs into a single logical pool. KV caches are placed onto whichever device offers the bandwidth they need. But the real trick is KV-passthrough: it bypasses the kernel filesystem entirely and talks directly to SSDs from user space via the Storage Performance Development Kit (SPDK). That single change slashes blocked I/O time by up to 23.2x.
4.1x Faster and Under 10 Seconds
Across evaluations on LLaMA 3.1-8B, GPT-OSS-20B, and Qwen3-30B-A3B, unified KV pooling cuts TTFT by roughly 4.1x compared to state-of-the-art techniques. Every model configuration lands under the 10-second mark. No exotic hardware, no model surgery - just a smarter storage abstraction and a kernel bypass that should have been standard years ago.
Long-context serving won't scale until the storage layer stops pretending SSDs are files. This patch is a clean proof that the bottleneck is software, not silicon.
Source: Unified KV Pooling to Accelerate Long-Context LLM Serving
Domain: arxiv.org
Comments load interactively on the live page.