Source linked

OmniPilot's 6.2% MAPE Fixes LLM Inference on Heterogeneous Clusters

A new launch advisor predicts throughput with 6.2% MAPE and 95% top-1 accuracy across A100, H100, H200 hardware and four precisions, while abstaining on unfamiliar configurations.

omnipilotllm inferencegpu clustersuncertainty quantificationai infrastructuresystems engineering

That routine choice—pick a GPU, tensor-parallel degree, and precision—costs you more than you think. OmniPilot just quantified the gap at 6.2% mean absolute percentage error (MAPE) across 460 benchmark runs on A100, H100, and H200 hardware.

What OmniPilot Actually Does

OmniPilot is a launch advisor for LLM inference on shared, heterogeneous GPU clusters. Instead of handing you a static configuration recipe that ignores fluctuating throughput, launch-success rates, or cluster demand, it predicts serving costs for every feasible configuration and then abstains when the request falls outside its measured support envelope. The system pairs a conformally calibrated quantile cost model—spanning eight serving targets—with an out-of-distribution (OOD) abstention layer. It ranks configurations using an economic utility metric calibrated to an operator's revealed preferences.

The Numbers That Matter

Across 460 benchmarks on A100, H100, and H200 across four precisions, OmniPilot predicts aggregate throughput with 6.2% MAPE and a log-space $R^2=0.92$. Top-1 accuracy hits 95%, and mean utility regret sits at just 0.003. Those are not academic rounding errors—they represent real GPU-hour savings when you're juggling model families that interact differently with quantization, key-value cache pressure, and tensor-parallel failure rates that vary by more than twofold.

Why Abstention Matters More Than Accuracy

Static configs miss critical interactions: quantization effects depend heavily on the model family, KV-cache pressure creates size-by-precision trade-offs, and failure rates vary by more than 2× across different tensor-parallel degrees. OmniPilot's OOD abstention layer catches what it doesn't know. When tested on an unsupported holdout, prediction error climbed to 24–46%, and conformal intervals covered 0 of 5 points. The layer flagged all five as low-confidence. Over time, those OOD scenarios get folded back into training, continuously expanding the advisor's support envelope.

Cluster operators finally have a way to say "I don't know" instead of serving a bad config that wastes node-hours. OmniPilot's design turns that uncertainty into a feedback loop that keeps getting tighter.


Source: OmniPilot: An Uncertainty-Aware LLM Inference Advisor for Heterogeneous GPU Clusters
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.