Source linked

FAIR-Calib empêche la quantification des décisions de token fragiles des LLM de diffusion

Une méthode de calibration simple en deux étapes réduit les changements de décision frontaliers dans LLaDA et Dream à des poids et des activations à 4 bits, bloquant les choix de jetons précoces correctes sans déploiements coûteux.

fair caliblladadreamdiffusion llmspost training quantizationlarge language models

Diffusion Large Language Models (dLLMs) refine tokens iteratively, but once written, a token is locked in—and quantization error at the exact moment of commitment flips those decisions irreversibly. The authors of FAIR-Calib found that standard PTQ error easily toggles borderline tokens at the write frontier, and those flips stick. They built a calibration method that doesn’t need expensive end-to-end rollouts yet consistently beats prior work on both LLaDA and Dream at W4A4 (4-bit weights and activations).

Why dLLM Quantization Is Harder Than Autoregressive

Autoregressive models generate left-to-right; a mistake in token $t$ can be corrected when predicting $t+1$. dLLMs commit a token, then continue refining only the future (or all remaining) positions. The authors call this “stability lag”: early decisions remain fragile even after they’re written, because later refinement passes can’t un-write them. PTQ error doesn’t just corrupt a single forward pass—it corrupts a permanent choice that downstream layers have to live with.

Standard calibration weights all hidden states equally, but that’s wrong for dLLMs. The frontier positions—tokens that are about to be written by the model’s discrete sampling step—are disproportionately sensitive. A tiny perturbation flips the decision, and the model can never take it back.

FAIR-Calib’s Two-Stage Attack: Probe Then Protect

Stage I runs a full-precision teacher model and computes a position prior that mixes two signals: how often a position is a frontier hit (write location) and the masked-stage reliability of that position. Stage II goes off-policy, performing layer-wise calibration by minimizing a reweighted hidden-state MSE. The weights come from the prior, so the optimizer spends most of its budget on the fragile frontier states. The authors prove this weighted objective is a surrogate for output KL divergence, giving it a solid theoretical grounding.

No end-to-end diffusion rollouts are required during calibration—just the teacher probes and a small calibration set. That’s a practical win for anyone who wants to ship quantized dLLMs without burning GPU-days on iterative distillation.

Results at 4-Bit That Actually Hold

On LLaDA and Dream, FAIR-Calib at W4A4 consistently outperforms state-of-the-art PTQ baselines (the paper cites specific perplexity and accuracy numbers across multiple benchmarks). More importantly, the method significantly reduces frontier decision flips and suppresses post-commit mismatches. The improvement isn’t marginal—it’s the difference between a model that drifts into incoherence and one that stays on track after quantization.

If you’re quantizing a diffusion LLM, the right place to spend your error budget is at the write frontier. FAIR-Calib gives you a principled way to do exactly that, and it works without reinventing your training pipeline. That’s the kind of quantization trick that actually makes production deployment feasible for this model class.


Source: FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.