Source linked

Low-bit KV Cache Quantization Silently Destroys LLM Safety Alignment

Mistral-7B loses 15.2% of its safety refusals under KV cache quantization with virtually no perplexity increase; a new diagnostic recovers up to 97% of lost alignment in 35 GPU-minutes.

kv cache quantizationper channel reductionmistral 7bvllmlarge language modelssafety alignment

Mistral-7B loses 15.2% of its safety refusals under KV cache quantization while perplexity barely budges—1.03x—and perplexity didn't catch it.

That's the headline finding from a new study spanning eleven instruction-tuned models (3.8B to 72B parameters) across five benchmarks totaling 1,894 prompts. The authors show that low-bit quantization can silently erase alignment: no universal safe bit-width exists, and each model hits a sharp phase transition that standard perplexity and accuracy metrics miss entirely.

Safety Alignment Collapses Without Warning

Existing evaluations of KV cache quantization only measure perplexity and accuracy. This work is the first to check whether the model still refuses harmful requests after compression. Across the board, models fail in different ways: some lose refusal capability at 4-bit, others at 3-bit, and the failure is abrupt. Mistral-7B, for example, drops from near-perfect refusal to 84.8% refusal rate with no perplexity spike. The authors call this "alignment collapse"—the model becomes less safe, and you'd never know from the standard numbers.

The Geometric Root Cause: Low-Dimensional Vulnerabilities

The team traced the collapse to geometry. Safety features live in a low-dimensional activation subspace that is $10^2$–$10^3$ times more vulnerable to quantization noise than the full representation space that perplexity averages over. When you quantize the KV cache, you're compressing the whole space, but the safety-relevant directions get crushed first. That asymmetric sensitivity explains why perplexity stays flat while refusal rates plummet.

Per-Channel Reduction: Three Failure Modes and a 35-Minute Fix

Inspired by that geometric insight, the authors propose Per-Channel Reduction (PCR). PCR classifies each model into one of three mechanistic failure modes:

  • Outlier-crushes-safety: safety resides in non-outlier channels, collateral damage from outlier-driven scale factors.
  • Outlier-as-safety: safety overlaps outlier channels, so finer granularity can't rescue it.
  • Multi-layer dilution: safety is distributed across many layers, so per-layer fixes fail.

PCR predicts the correct mitigation direction on all nine primary models and one held-out model from an independent family using just 20 calibration prompts. The resulting training-free protocol requires approximately 35 GPU-minutes and recovers up to 97.2% of lost alignment when tested against KIVI, a production-level quantizer. Crucially, PCR succeeds where attention-based allocation methods fail.

Production-Ready: vLLM on NVIDIA GPUs

The authors validated the vulnerability in production vLLM serving with FP8 KV cache on NVIDIA GPUs. That means this isn't a toy experiment—anyone running quantized inference at scale on current hardware is exposed to silent alignment erosion. PCR provides a lightweight diagnostic and mitigation that adds minimal memory overhead, letting you keep the memory savings of quantization without trading away safety.

Turning a 35-minute diagnostic into a build-time check would make low-bit inference safe by default. This paper gives the field the tool to do it.


Source: Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.