Source linked

SPEAR Ferme 56-75% du fossé de perplexité entre les LLM 4 bits et FP16

Un nouveau système introduit des compensateurs d'erreur dépendant de l'entrée qui récupèrent la plupart de la perte de qualité de la quantification agressive tout en ajoutant moins de 1% de mémoire.

spearllm servingquantizationerror compensationkernel fusionlarge language models

Quantization to 4 bits cuts LLM serving costs but typically leaves a perplexity gap of several points vs FP16—SPEAR recovers 56-75% of that gap with sub-1% memory overhead, proving input-dependent error compensation works where static methods fail.

Why Static Error Correction Bottlenecks Low-Bit LLMs

Today’s 4-bit quantizers lose quality because quantization error varies wildly across tokens. Easy tokens get over-corrected by static compensation; hard tokens remain under-corrected. SPEAR’s team identified this root cause by showing that existing post-quantization methods apply identical corrections to every input, ignoring the input-dependent nature of the error.

SPEAR breaks that pattern with lightweight Error Compensators (ECs) gated per-token, placed only at the most error-sensitive layers. The team uses a CKA-guided entropy-aware diagnostic to pinpoint those layers, concentrating a small parameter budget where it actually moves the needle. The result: adaptive correction that matches the difficulty of each token.

Three Systems Tricks That Make Adaptive Correction Fast

Adaptive gating introduces system headaches—extra computation, tensor-parallel synchronization from input-dependent control flow, and latency jitter. SPEAR’s developers solved these with three specific innovations.

First, an adaptive kernel-fusion dispatch that combines an epilogue-integrated peer-reduction kernel with P2P dual-write. This fuses the post-EC computation into the low-bit GEMMs, avoiding separate kernel launches. Second, the P2P dual-write pattern minimizes synchronization overhead by letting EC outputs bypass the usual all-reduce barrier. Third, an SLO-constrained EC-aware scheduler absorbs the remaining latency variance, delivering predictable serving performance even under dynamic gating.

Measured Impact: 56-75% Gap Recovery at Negligible Cost

Across challenging per-channel quantization settings, SPEAR recovers 56-75% of the perplexity gap between W4 and FP16. The memory cost? Less than 1% of model parameters. Latency stays comparable to a widely used 4-bit serving deployment—no regression for the 75% recovery.

These numbers matter most for smaller models, where low-bit serving delivers the biggest cost savings but also suffers the largest quality hit. SPEAR flips that tradeoff: you keep the cost reduction and get back almost all the quality.

By making low-bit quantization nearly lossless for smaller models, SPEAR shifts the cost-quality tradeoff for production LLM serving—expect to see this kernel fusion approach adopted in inference engines within a year.


Source: SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.