FP16 Randomized Sketching Matches FP32 Accuracy on GPUs

Memory bandwidth, not FLOPS, is the bottleneck for sparse sketching on GPUs. That makes mixed precision an obvious lever, but only if FP16 accumulation doesn't trash the embedding. A team behind the new arXiv:2606.20195 preprint shows it doesn't.

SparseStack Pushes Through the Atomic Bottleneck

CountSketch is the workhorse GPU kernel for sparse subspace embeddings, but it struggles on coherent inputs. Enter SparseStack, a generalization that adds extra nonzeros per column to improve embedding quality on those tough inputs. More nonzeros means more atomic-update contention, which usually kills throughput. The authors went ahead and implemented FP16 SparseStack variants anyway, using deterministic round-to-nearest, exact stochastic rounding, and dithered rounding.

Rounding Rule Barely Matters

Here is the punchline: across incoherent, coherent, and adversarial test problems, all three FP16 rounding methods produced nearly identical subspace distortion and sketch-and-solve least-squares accuracy. The dominant factor is the sketch distribution itself: SparseStack variants substantially cut distortion on coherent inputs, while all methods behave alike on incoherent ones. The rounding rule is essentially irrelevant.

Deterministic Wins by Default

Since the quality is insensitive to the rounding strategy, the one with the lowest overhead wins. Deterministic round-to-nearest FP16 SparseStack delivered the best performance-accuracy tradeoff among the FP16 variants, matching FP32 accuracy while running faster. No need for complex stochastic rounding hardware or extra logic.

What That Means for Sketching on GPUs

This result flips a common assumption in randomized numerical linear algebra: low-precision accumulation can be safe for sparse sketching, as long as the sketch distribution is well-designed. The next question is whether this robustness extends to other sketch families and even lower precisions like FP8. If it does, we can push sketching deeper into the GPU memory hierarchy without sacrificing quality.

Source: Randomized Sketching is Robust to Low-Precision Rounding on GPUs
Domain: arxiv.org