Source linked

P2P Prefix-Cache Routing Cuts LLM Inference Latency Without Central Coordination

A decentralized routing scheme uses local radix trees and periodic anti-entropy to match prefix caches across LLM serving nodes, avoiding KV-cache transfers and centralized coordination.

arxivllm inferenceprefix cachingpeer to peerdistributed systemskv cache

Prefix caching inside a single node is simple; making it work across a cluster without shuffling KV-caches around is the hard part. This paper skips the centralized shuffle entirely.

Radix Trees and Anti-Entropy Keep Routing Decentralized

Each LLM serving node builds a local radix tree of its own cached prefixes. Periodically, nodes exchange metadata about what they hold using anti-entropy gossip. No central scheduler decides where a request goes. Instead, the node that receives a query checks its local tree and, if the match isn't long enough, forwards the request to the peer with the longest estimated prefix match. The key insight: stale metadata only costs a cache miss - it never produces wrong outputs. Weak consistency is enough for correctness.

MMLU Workloads Show Gains Under Low Latency, Skewed Prefixes

Evaluation on simulated MMLU workloads tells a clear story. When communication delay between nodes is low and prefix distributions are skewed - meaning many queries share the same prompt prefixes - decentralized routing cuts latency. The local radix tree gives fast matches, and the gossip-based estimates let requests land on nodes that already have the relevant KV-cache materialized. But the scheme has limits. High network latency between peers erodes the latency advantage because the routing decision itself takes time. And affinity-induced hotspots - where many requests pile onto the same node - can degrade performance.

Weak Consistency Is a Feature, Not a Bug

The architecture intentionally avoids KV-cache transfer. Moving a cache from one node to another in a P2P network would add bandwidth and latency that defeats the purpose. Instead, the system accepts that a node's view of peer caches may be minutes out of date. A request might get routed to a node that has since evicted the prefix, forcing a recompute from scratch. That's a cache miss, not a correctness failure. The paper argues that this tradeoff makes the system practical for peer-to-peer deployments where central coordination is infeasible.

What this enables next: a path toward serving LLMs across loosely federated consumer hardware without dedicated cluster management. If the technique holds up under real network conditions, we might see inference-as-a-service built on P2P overlays rather than datacenters.


Source: Towards Distributed Inference of LLMs on a P2P Network
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.