Source linked

KV Cache Pooling Slashes لفترة طويلة LLM 4x

arxiv.org@systems_wire2 days ago·Systems Engineering·1 comments

يحدد جمعية KV المشتركة والتجسس على الكورلور خفض تقييم الوقت إلى الأول من 30.7 إلى أقل من 10 على النماذج 8B-30B.

unified kv poolingkv cachelong context llmspdkllm servingfilesystem overhead

Time-to-first-token for long-context LLM serving clocks in at 30.7 seconds - more than 3x the 10-second threshold most interactive applications tolerate. That's not a pipeline issue; it's a storage architecture problem.

Why 30.7 Seconds is Unacceptable

Running LLaMA 3.1-8B, GPT-OSS-20B, or Qwen3-30B-A3B with long contexts forces KV cache offload to host memory and SSDs. Current caching mechanisms weren't designed for contexts that spill past GPU memory. Two root causes: retrieval is serialized across host-memory and SSD, leaving other modules idle. Worse, SSD-based KV retrieval spends 84% of its time burning cycles in the kernel filesystem - not actually reading or writing data.

KV-Passthrough: Bypassing the Kernel Tax

The authors - paper 2606.14779 - built a unified KV pooling layer that aggregates multiple host-memory modules and SSDs into a single logical pool. KV caches are placed onto whichever device offers the bandwidth they need. But the real trick is KV-passthrough: it bypasses the kernel filesystem entirely and talks directly to SSDs from user space via the Storage Performance Development Kit (SPDK). That single change slashes blocked I/O time by up to 23.2x.

4.1x Faster and Under 10 Seconds

Across evaluations on LLaMA 3.1-8B, GPT-OSS-20B, and Qwen3-30B-A3B, unified KV pooling cuts TTFT by roughly 4.1x compared to state-of-the-art techniques. Every model configuration lands under the 10-second mark. No exotic hardware, no model surgery - just a smarter storage abstraction and a kernel bypass that should have been standard years ago.

Long-context serving won't scale until the storage layer stops pretending SSDs are files. This patch is a clean proof that the bottleneck is software, not silicon.

Source: Unified KV Pooling to Accelerate Long-Context LLM Serving
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Systems Engineering

view topic

LLM Anchoring Bias Wastes 25% Energy in 6G Network Slicing

A randomized anchoring strategy using a Truncated Weibull distribution eliminates rigid negotiation patterns, cutting energy waste from LLM-driven 6G slicing while keeping inference under one second on a 1B-parameter...

Mozilla's Perfherder Misses 6.8% of Regressions, CPD Ensembles Fix That

An evaluation of 25 change-point detection methods on Mozilla production data shows ensemble voting cuts false positives and recovers missed regressions, boosting F1-score by 11%.

SMT Solver Shuts Vacuous Verification of PLC Ladder Diagrams

Previous ESBMC-PLC parsed graphical ladder diagrams into empty IR, silently passing all checks. Graph-ESBMC-PLC's DFS resolver now extracts real rung logic and verifies correctness via SMT in under 70ms.

CHERI-D Inlines Object IDs for Fast Use-After-Free Prevention

CHERI-D stores object IDs inline with allocation data using unused fragmentation, cutting revocation overhead versus Cornucopia Reloaded while enabling strict use-after-free protection.

Comments load interactively on the live page.