Source linked

UltraQuant Cuts KV Cache Latency 3.47x on Agent Workloads Using 4-Bit FP4 Path

A new 4-bit KV-cache compression scheme for context-heavy agents reduces P50 time-to-first-token by 3.47x in late rounds and boosts throughput 1.63x over FP8 baselines on AMD CDNA4 hardware.

ultraquantkv caching4 bitamdcdna4large language models

UltraQuant slashes P50 time-to-first-token by 3.47x in cache-pressured late rounds on a multi-turn agent workload, compared to an FP8 KV baseline.

Context-heavy agents create a lopsided problem for KV caching. Long prefixes repeat across many short turns, and concurrency determines whether your GPUs stay busy. Standard FP8 caching works, but it leaves throughput on the table when the cache grows fat. UltraQuant goes to 4-bit without the usual quality cliff.

Why Agent Workloads Break Existing KV Caches

Multi-round agent tasks stress the cache differently than chat or single-turn generation. The paper frames the problem around three metrics that must be measured jointly: task quality, cache residency, and serving throughput. Most KV-cache compression work ignores the interplay between these for long-context, short-turn loops.

TurboQuant-style rotation and codebook quantization serve as the quality anchor. vLLM FP8 KV caching is the deployment anchor. UltraQuant beats both on the agentic workload by optimizing for the specific pattern of cache churn.

Four Bits Without the Quality Cliff

The paper describes practical choices that make 4-bit robust. Asymmetric K/V treatment matters because keys and values have different statistical properties. Walsh-Hadamard rotation replaces more expensive orthogonal transforms. QJL removal cuts unnecessary overhead. Block-scale variants give fine-grained control over quantization error.

These aren't theoretical tweaks. They are design decisions tuned on real agent traces. The result is a 4-bit path that preserves task quality while cutting memory footprint by half relative to FP8.

AMD GPUs Get a 4-Bit Upgrade

UltraQuant includes serving optimizations specific to AMD GPUs with CDNA4 architecture. The FP4 approximation path uses FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA support. Optimized decode-attention kernels keep the GPU fed even when the cache is under pressure.

Across all rounds of the agentic workload, P50 TTFT drops 2.3x. Output throughput climbs 1.63x over the FP8 KV baseline. Those numbers come from real hardware, not a simulator.

UltraQuant suggests that 4-bit KV caching is not a compromise but a net win for the growing class of context-heavy agents, especially as hardware like CDNA4 provides native support for these lower-precision paths.


Source: UltraQuant: 4-bit KV Caching for Context-Heavy Agents
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.