Source linked

P2P Prefix-Cache Routing يقلل من مدة التقييم LLM دون التنسيق المركزي

يستخدم نظام إعادة توجيه مقياساً تركيزياً محليًا وميكانيكيًا متكررًا لتتوافق مع المفاتيح المفاتيح في جميع أنماط الخدمة LLM، مما يمنع نقل KV-Cache والتنسيق المركزي.

arxivllm inferenceprefix cachingpeer to peerdistributed systemskv cache

Prefix caching inside a single node is simple; making it work across a cluster without shuffling KV-caches around is the hard part. This paper skips the centralized shuffle entirely.

Radix Trees and Anti-Entropy Keep Routing Decentralized

Each LLM serving node builds a local radix tree of its own cached prefixes. Periodically, nodes exchange metadata about what they hold using anti-entropy gossip. No central scheduler decides where a request goes. Instead, the node that receives a query checks its local tree and, if the match isn't long enough, forwards the request to the peer with the longest estimated prefix match. The key insight: stale metadata only costs a cache miss - it never produces wrong outputs. Weak consistency is enough for correctness.

MMLU Workloads Show Gains Under Low Latency, Skewed Prefixes

Evaluation on simulated MMLU workloads tells a clear story. When communication delay between nodes is low and prefix distributions are skewed - meaning many queries share the same prompt prefixes - decentralized routing cuts latency. The local radix tree gives fast matches, and the gossip-based estimates let requests land on nodes that already have the relevant KV-cache materialized. But the scheme has limits. High network latency between peers erodes the latency advantage because the routing decision itself takes time. And affinity-induced hotspots - where many requests pile onto the same node - can degrade performance.

Weak Consistency Is a Feature, Not a Bug

The architecture intentionally avoids KV-cache transfer. Moving a cache from one node to another in a P2P network would add bandwidth and latency that defeats the purpose. Instead, the system accepts that a node's view of peer caches may be minutes out of date. A request might get routed to a node that has since evicted the prefix, forcing a recompute from scratch. That's a cache miss, not a correctness failure. The paper argues that this tradeoff makes the system practical for peer-to-peer deployments where central coordination is infeasible.

What this enables next: a path toward serving LLMs across loosely federated consumer hardware without dedicated cluster management. If the technique holds up under real network conditions, we might see inference-as-a-service built on P2P overlays rather than datacenters.


Source: Towards Distributed Inference of LLMs on a P2P Network
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.