Source linked

SPEAR Closes 56-75% of the Perplexity Gap Between 4-bit and FP16 LLMs

A new system introduces input-dependent error compensators that recover most of the quality loss from aggressive quantization while adding less than 1% memory overhead.

spearllm servingquantizationerror compensationkernel fusionlarge language models

Quantization to 4 bits cuts LLM serving costs but typically leaves a perplexity gap of several points vs FP16—SPEAR recovers 56-75% of that gap with sub-1% memory overhead, proving input-dependent error compensation works where static methods fail.

Why Static Error Correction Bottlenecks Low-Bit LLMs

Today’s 4-bit quantizers lose quality because quantization error varies wildly across tokens. Easy tokens get over-corrected by static compensation; hard tokens remain under-corrected. SPEAR’s team identified this root cause by showing that existing post-quantization methods apply identical corrections to every input, ignoring the input-dependent nature of the error.

SPEAR breaks that pattern with lightweight Error Compensators (ECs) gated per-token, placed only at the most error-sensitive layers. The team uses a CKA-guided entropy-aware diagnostic to pinpoint those layers, concentrating a small parameter budget where it actually moves the needle. The result: adaptive correction that matches the difficulty of each token.

Three Systems Tricks That Make Adaptive Correction Fast

Adaptive gating introduces system headaches—extra computation, tensor-parallel synchronization from input-dependent control flow, and latency jitter. SPEAR’s developers solved these with three specific innovations.

First, an adaptive kernel-fusion dispatch that combines an epilogue-integrated peer-reduction kernel with P2P dual-write. This fuses the post-EC computation into the low-bit GEMMs, avoiding separate kernel launches. Second, the P2P dual-write pattern minimizes synchronization overhead by letting EC outputs bypass the usual all-reduce barrier. Third, an SLO-constrained EC-aware scheduler absorbs the remaining latency variance, delivering predictable serving performance even under dynamic gating.

Measured Impact: 56-75% Gap Recovery at Negligible Cost

Across challenging per-channel quantization settings, SPEAR recovers 56-75% of the perplexity gap between W4 and FP16. The memory cost? Less than 1% of model parameters. Latency stays comparable to a widely used 4-bit serving deployment—no regression for the 75% recovery.

These numbers matter most for smaller models, where low-bit serving delivers the biggest cost savings but also suffers the largest quality hit. SPEAR flips that tradeoff: you keep the cost reduction and get back almost all the quality.

By making low-bit quantization nearly lossless for smaller models, SPEAR shifts the cost-quality tradeoff for production LLM serving—expect to see this kernel fusion approach adopted in inference engines within a year.


Source: SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.