92% reduction in time-to-first-token (TTFT) and 51% reduction in time-per-output-token (TPOT) — all without a single hyperparameter knob to turn. That’s what LMETRIC delivers for LLM request scheduling, and the trick is embarrassingly simple: multiply two numbers.
Why Hyperparameter Tuning Is Dead
LLM schedulers have to balance two conflicting objectives: route a request to an instance that already has relevant KVCache (to avoid recomputing prefill tokens) and keep load balanced across instances. Current approaches throw combinators — linear combinations, weighted sums — at the problem, then optimize those weights for each workload. That means either expensive workload-specific tuning or building a model-hardware simulator. The authors show those hyperparameters are redundant. When you multiply a KVCache-aware indicator (new prefill tokens if routed to an instance) by a load-balancing-aware indicator (current batch size), the scaling factors cancel out during comparison. No tuning needed.
How LMETRIC Works
Two indicators, one product, one scheduling score. The KVCache indicator penalizes instances that would force expensive recomputation; the batch-size indicator discourages overtaxed instances. The multiplication treats both objectives jointly, and the authors prove mathematically that the optimal ordering is invariant to the relative weights. The only failure mode is when both indicators are zero, which is vanishingly rare in practice and detectable ahead of time.
Real-World Results and Caveats
On real chatbot and coding-agent workloads, LMETRIC beats vLLM-v1 by 92% TTFT and 24% TPOT, and beats an unnamed in-production scheduler by 39% TTFT and 51% TPOT. Those aren’t simulated numbers — LMETRIC is already deployed in production, with a canary release confirming the gains. The authors also derive the precise mathematical conditions under which multiplication could fail, but confirm they are “extremely rare” and can be mitigated before routing. If your LLM cluster still uses weighted-sum scheduling, you’re leaving latency on the table.
Source: Simple is Better: Multiplication May Be All You Need for LLM Request Scheduling
Domain: arxiv.org
Comments load interactively on the live page.