Source linked

Altersbasierter Zeitplan schneidet LLM, der Latency von 10% dient

Ein neues Framework kombiniert dynamische Prioritätsalterung mit vorausschauender Latenzkontrolle, um die durchschnittliche End-to-End-Latenz auf NVIDIA- und Ascend-Hardware um über 10% zu senken, um Head-of-Line-Blockierungen und Anfrage-Hungerungen in...

large language modelsllm servingchunked prefillschedulingnvidiaascend

Over 10% mean end-to-end latency drop on real NVIDIA and Ascend hardware, achieved by replacing First-Come-First-Served with an aging-based scheduler that actually cares about fairness. That's the headline from a new arXiv paper tackling the dirty secret of chunked-prefill LLM serving: head-of-line blocking and request starvation when workloads go heterogeneous.

Why Static Token Budgets Fail Under Heterogeneous Load

Existing chunked-prefill engines use rigid FCFS policies and fixed token budgets. Works fine for uniform traffic; falls apart when a single huge prefill request arrives. That request blocks everything behind it — clients time out, smaller requests starve. The paper's authors call this out directly: fairness degrades, latency jitter becomes unpredictable. I've seen this pattern in production logs; it's the kind of problem that makes engineers reach for hacks like manual request reordering.

Three Levers: Aging, Prediction, Active Control

The framework deploys three coordinated mechanisms. First, a lightweight aging-based scheduling policy that dynamically computes priorities from accumulated waiting time and remaining prefill work. No static thresholds — the priority decays naturally as a request waits too long. Second, Latency-Prediction-Based Request Scheduling (LPRS) replaces fixed token budgets with target-time constraints. The scheduler predicts how long a chunked prefill will take and adjusts scheduling to meet a latency SLA rather than hard-limiting tokens. Third, Active Prefill Control (APC) actively regulates prefill concurrency at the engine level, suppressing fragmentation that arises from naive chunking.

Real Hardware, Real Numbers

Evaluated on both NVIDIA GPUs and Ascend accelerators with real-world workload traces, the aging policy alone reduces mean end-to-end latency by over 10% compared to FCFS. More important: LPRS and APC together cut P99 tail latency significantly and suppress prefill fragmentation. The paper explicitly states that structural prefill control and temporal latency constraints are fundamentally complementary — meaning you need both, not one or the other. All code is on GitHub, so you can reproduce the results instead of taking their word.

Expect this to land in production engines within a year. The problem is too painful to ignore, and the solution is simple enough to deploy without rewriting the entire serving stack.


Source: Fairness-Aware and Latency-Controllable Scheduling for Chunked-Prefill LLM Serving
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.