Source linked

Selling KV Caches Cuts AI Prefill Costs 50x

Loading a precomputed KV cache matches prefill token-for-token at 9-50x lower compute, but shipping the cache over the network costs more than it saves; provider-side hosting unlocks the real margin.

luoyuan zhangkv cacheqwen3 4bai agentsinference optimizationprefill caching

80 million agents re-prefilling a single 3774-token document costs $1.5M in compute. Reusing a cached key-value (KV) input costs $0.03M. That's a 49.7x gap, and Luoyuan Zhang's paper is the first to ask the obvious question: why isn't there a market for KV caches?

Every Agent Rebuilds the Same Castle

Every AI agent reading a document runs the prefill pass—the most compute-intensive step in a large model—over the exact same text as every other agent. Zhang calls it an "absurd act" and backs it up with numbers: on Qwen3-4B, reuse is 9–50x cheaper than computing from scratch, and the gap grows with document length because prefill attention scales as $L^2$. The compute spent on repeated prefill across millions of agents is a staggering waste of GPU cycles.

Token-Exact Reuse, No Accuracy Cost

The proposal is almost offensively simple: let a publisher precompute a document's KV cache once, then let any agent buy the right to load it and continue from there. Zhang verifies token-exact match: 24/24 greedy tokens identical, and logit-level agreement, with zero accuracy penalty. On Qwen3-4B, a single reuse already pays back the precomputation cost, and every subsequent reuse is pure savings.

The Egress Trap

Here's where the idea hits reality. Shipping the KV cache to a client fails because KV is nearly incompressible. Per-load egress costs more than the prefill you saved. The fix is provider-side hosting—exactly how production prompt-caching works—which eliminates egress entirely. Keep the cache inside the serving provider's network, and the savings survive.

The Size of the Prize

Zhang runs the numbers: one hot 3774-token document served to 80M agents. Re-prefill: ~$1.5M. Reuse compute: ~$0.03M. The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope. The 10x is a floor; the measured ~50x compute saving clears it easily. The gap to the physical ~50x is provider margin—millions of dollars per popular document. The paper frames the resulting agent-native prefill CDN and leaves lossless KV compression and a cross-party payment layer as the open problems. If anyone figures out those two, the economic case for a KV cache marketplace becomes irresistible.


Source: Can I Buy Your KV Cache?
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.