2.45x faster time-to-first-token, and it does so by solving a fundamental algebraic mismatch between position-independent caching and hybrid-attention models. Previous PIC systems worked for full-attention transformers, but linear-attention layers in hybrid models use a per-request recurrent state that can't be decomposed into per-token KV caches. Hypic, a new serving system from an unannounced academic team, is the first to fix that.
The Algebraic Blind Spot That Broke Hybrid-Attention Caching
Linear-attention layers don't expose per-token hidden states—they maintain a segment-level recurrent state. Existing PIC primitives (like prefix caching or reuse of non-contiguous KV spans) rely on per-token operations. Hypic identifies the missing algebraic primitive: the segment-cumulative transition operator. By caching this operator alongside each segment's zero-start end-state, Hypic can compose independently cached segments in near-exact, constant time. No recomputation of the linear part, no accuracy loss.
Full-attention layers in hybrid models still matter, and they break PIC in a different way. Without per-token hidden states from the linear layers, selective recomputation of attention tokens (the standard band-aid) is impossible. Hypic's insight: the most significant attention deviation concentrates at segment boundaries. Recomputation of a small seam window—just the boundary tokens—restores cross-segment lookback with negligible overhead.
2.45x Lower TTFT and 2.0x Throughput Without Sacrificing Accuracy
Hypic exploits segment-level self-containment to parallelize cache-miss prefill across instances. This turns cold requests—previously a major tail-latency contributor under both prefix caching and prior PIC—into an accelerable workload. Evaluated across four hybrid-attention models (no names given in the abstract, but presumably recent efficient architectures) and five diverse workloads, Hypic reduces TTFT by 2.45x on average and boosts peak throughput by up to 2.0x compared to existing serving systems. Accuracy stays within 3.3 points of full recompute, meaning the caching approximations are practically lossless for RAG and agentic tasks.
The abstraction is clean: linear layers get a new primitive, full-attention layers get a boundary trick, and the entire pipeline gets instance-level parallelism. Hypic's design directly attacks the prefill-dominant cost model that defines modern RAG and agentic LLM serving—where prompts are assembled from independent segments into long contexts. Cold requests become warm as soon as you've cached the segment operators.
What This Enables Next
Hypic doesn't just accelerate existing hybrid-attention models; it reopens the design space for serving systems. If the segment-cumulative transition operator becomes a standard cache primitive, model architects can lean harder on linear attention without worrying about serving overhead. Expect follow-up work on adaptive seam-window sizes and operator caching hierarchies. For now, Hypic cuts the cost of long-context prefill by more than half—a concrete win for anyone running hybrid-attention models at scale.
Source: HYPIC: Accelerating Hybrid-Attention LLM Serving with Position-Independent Caching
Domain: arxiv.org
Comments load interactively on the live page.