Source linked

KV Cache Pooling Slashes Long-Context LLM Latency 4x

arxiv.org@systems_wire2 hours ago·Systems Engineering·1 comments

Unified KV pooling and a kernel-bypass trick cut time-to-first-token from 30.7s to under 10s across 8B-30B models.

unified kv poolingkv cachelong context llmspdkllm servingfilesystem overhead

Time-to-first-token for long-context LLM serving clocks in at 30.7 seconds - more than 3x the 10-second threshold most interactive applications tolerate. That's not a pipeline issue; it's a storage architecture problem.

Why 30.7 Seconds is Unacceptable

Running LLaMA 3.1-8B, GPT-OSS-20B, or Qwen3-30B-A3B with long contexts forces KV cache offload to host memory and SSDs. Current caching mechanisms weren't designed for contexts that spill past GPU memory. Two root causes: retrieval is serialized across host-memory and SSD, leaving other modules idle. Worse, SSD-based KV retrieval spends 84% of its time burning cycles in the kernel filesystem - not actually reading or writing data.

KV-Passthrough: Bypassing the Kernel Tax

The authors - paper 2606.14779 - built a unified KV pooling layer that aggregates multiple host-memory modules and SSDs into a single logical pool. KV caches are placed onto whichever device offers the bandwidth they need. But the real trick is KV-passthrough: it bypasses the kernel filesystem entirely and talks directly to SSDs from user space via the Storage Performance Development Kit (SPDK). That single change slashes blocked I/O time by up to 23.2x.

4.1x Faster and Under 10 Seconds

Across evaluations on LLaMA 3.1-8B, GPT-OSS-20B, and Qwen3-30B-A3B, unified KV pooling cuts TTFT by roughly 4.1x compared to state-of-the-art techniques. Every model configuration lands under the 10-second mark. No exotic hardware, no model surgery - just a smarter storage abstraction and a kernel bypass that should have been standard years ago.

Long-context serving won't scale until the storage layer stops pretending SSDs are files. This patch is a clean proof that the bottleneck is software, not silicon.

Source: Unified KV Pooling to Accelerate Long-Context LLM Serving
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Systems Engineering

view topic

256KB of Code for 64KB of Data: Emulator Team Fixed a Program at Runtime

A binary translator encountered a compiler-generated unrolled loop with 65,536 individual write instructions. The team added a pattern-matching optimizer to replace it with a tight loop during emulation.

MADAR Processor Abolishes Addresses, Uses Orbiting Rings for Data Flow

MADAR removes register file and cache addressing entirely with a compile-time scheduled ring hierarchy that keeps per-operation energy flat as AI matrix multiply scales.

uringscope: eBPF Observability for io_uring at 0.7-9.9% Overhead

uringscope uses CO-RE eBPF to reconstruct per-request io_uring flows with a throughput impact as low as 0.7% on NVMe workloads, making invisible I/O operations visible again.

iSLIP Stalls at 80% Throughput; Spectral Scheduling and OT Keep Up With MWM

Under non-uniform admissible traffic at high load, iSLIP throughput stalls around 80% while spectral scheduling and entropy-regularized optimal transport match the exact maximum weight matching benchmark.

Comments load interactively on the live page.