Source linked

97-99% Accuracy Retained While Slashing TPOT: Cascaded LLM Serving That Routes Smart

A two-stage cascaded framework keeps 97-99% of the strongest model's accuracy but directs only hard queries to expensive models-an interpretable hyperparameter controls the cost-accuracy knob.

llm servingcascaded frameworkcost optimizationquality estimationarxiv 2606 27457routing

A cascaded LLM serving framework retains 97–99% of the strongest model's accuracy while cutting Time Per Output Token (TPOT). That’s the headline number from arXiv:2606.27457, and it’s the kind of practical lever operators have been missing.

The Core Problem: One Model Doesn’t Fit All

Defaulting to a single LLM for every query is lazy and expensive. Easy queries get over-served by a costly model; hard ones get under-served by a cheap one. The paper’s two-stage approach fixes that without manual prompt engineering or per-query cost guessing.

Stage 1 clusters incoming queries and assigns each cluster to the most cost-effective model from the available pool. The routing cost budget is set by a single interpretable hyperparameter, tuned offline. No black-box knobs. Stage 2 adds a quality estimation (QE) cascade: if Stage 1’s output is judged low-quality, the query gets escalated to a stronger model. Only hard or low-confidence cases ever touch the expensive models.

Why This Works in Practice

On test datasets, the system delivers near-top-tier accuracy with materially lower TPOT. It requires only task-correctness labels—no human preference data, no costly annotation. When the model pool changes, the system adapts without reconfiguration. That’s a massive operational win for teams running production LLM pipelines where model versions rotate weekly.

The approach effectively decouples cost from query complexity. Instead of guessing which model to use per query, you let unsupervised clustering and a cheap quality assessor handle the routing. The interpretable hyperparameter gives operators a clear dial: turn up the budget for higher accuracy, turn it down to save spend.

What This Enables Next

Expect similar cascaded architectures to become standard in LLM serving stacks. The paper proves you don’t need a single “best” model—you need a smart routing layer that knows when to escalate. Production teams should be looking at their existing query logs, clustering by embedding similarity, and running this exact offline tuning. The 97–99% retention number will only improve as QE models get cheaper and better.


Source: Cluster, Route, Escalate: Cascaded Framework for Cost-Aware LLM Serving
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.