Source linked

Iteration inverse et S=256 Fix FP8 Attention's P-Collapse

Les zéros d'itération KV à l'avant produisent 3 à 10 fois plus de valeurs softmax dans l'attention de FP8 ; un seuil de forme fermée prédit l'effondrement, et l'ordre inverse plus S=2^8 élimine complètement le sous-flow.

flashattention 3flashattention 4fp8attention mechanismslarge language modelsprecision optimization

Forward KV iteration in FP8 attention silently zeros out a measurable fraction of softmax probabilities. A new analysis from arXiv:2606.06521 pins down exactly how many: under the Attention Sink phenomenon, non-sink P values underflow to zero at a rate given by $\Phi(\Delta + \delta_k - 6.93 - \ln S)$, where $S$ is the static scaling factor before casting P to FP8. When $S=256$, that fraction drops to zero—reverse iteration guarantees it.

Why Forward Iteration Breaks P and Reverse Fixes It

Softmax output is cast from FP32 to E4M3 before the $P \times V$ multiply. The 3-bit mantissa can't represent small probabilities, but the damage depends on iteration order. Forward iteration lets sink tokens dominate the score maximum, pushing non-sink values below the FP8 minimum representable number. The paper shows that shift $\delta_k \approx 1$ for $k_{\text{sink}}=4$—the expected within-sink-block score maximum—pushes the underflow threshold just enough to cause collapse. Reverse iteration flips the order: later tokens benefit from earlier sink-induced calibration, and combined with $S=256$ you get a zero-underflow guarantee.

Why S=2^8 Beats Every Other Scale

The static scale $S$ is applied to P before casting. The authors prove $S=256 = 2^8$ is optimal on three independent criteria: (i) it's a bit-exact IEEE 754 power-of-two scale, (ii) it sits at the lower envelope of the sawtooth function $dp(S)$ over E4M3 number line, giving minimum worst-case quantization step $dp = 2^{-4}$, and (iii) it maximizes normal-range coverage among all bit-exact $2^k$ scales. A non-bit-exact scale like 448 covers slightly more of the range, but breaks the bit-exact property and complicates hardware. The closed-form threshold $\Delta_c = 6.93 + \ln S - \delta_k$ lets engineers predict kernel-level precision loss without running the kernel.

Already Deployed, Now Explained

These two optimizations—reverse KV iteration and $S=256$—are already baked into FlashAttention-3 and FlashAttention-4. The paper's contribution is a quantitative account of why those engineering choices are good. Kernel-faithful experiments with Q, K, V in FP32 (isolating the P-cast effect) show 3-10x MSE improvement at moderate sink strengths. Paired tests confirm both fixes saturate to the same precision floor when combined.

Next time you profile an FP8 attention kernel and see unexplained precision loss, check the iteration order before blaming the hardware.


Source: P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2^8
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.