Mixed-Precision CA-SGD Hits 6.8x Speedup Over FP32 on A100 GPUs

5.1x to 6.8x speedup over FP32 SGD on real-world datasets, and it's not because the math got smarter - it's because the communication got out of the way.

Distributed SGD stalls on AllReduce latency, not compute. Each iteration forces a full synchronization that wastes GPU cycles while nodes wait. A team from NERSC and collaborators just published a practical fix: mixed-precision communication-avoiding SGD (CA-SGD) for generalized linear models on NVIDIA A100 GPUs.

Replacing AllReduce Chains with a Single Gram Matrix AllReduce

Standard CA-SGD amortizes communication over s iterations by replacing s consecutive AllReduces with a single AllReduce of an sb×sb Gram matrix. You trade more local computation (a Gram GEMM) for fewer synchronization points. Modern GPUs with matrix math units and reduced-precision formats make that trade win: the Gram GEMM runs fast in low precision, and the AllReduce traffic shrinks with BF16.

The authors go further by mixing precisions across every step of the outer iteration. Their finite-precision analysis decomposes the local rounding error of one CA-SGD outer iteration into nine independent precision choices. Those choices depend on the hardware only through its low-precision unit roundoffs, so the recipe transfers across GPU generations - no re-tuning needed when you swap V100 for A100.

The recommended recipe: store the input matrix and margin vector in low precision, compute the Gram matrix from low-precision inputs with high-precision accumulation, communicate it in high precision, and perform inner recurrence and weight updates in high precision.

Real Speedups on Logistic, Linear, and Poisson Problems

On NERSC Perlmutter A100 GPUs, mixed-precision CA-SGD matches FP32 SGD loss within 0.5% across logistic, linear, and Poisson problems. The speedup range of 5.1x to 6.8x comes from datasets epsilon, SUSY, HIGGS, synth, and Poisson-synth - all standard distributed GLM benchmarks.

One caveat: this is for GLMs, not deep neural networks. The structure allows the Gram matrix trick; transformer training needs different tricks. But for any practitioner running distributed logistic regression or Poisson regression at scale, this is a drop-in replacement that cuts wall time by more than 80%.

The code is already public on Zenodo at doi:10.5281/zenodo.20448273. If you are bottlenecked on AllReduce and working with GLMs, there is no reason not to try it tomorrow.

Source: Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs
Domain: arxiv.org

Mixed-Precision CA-SGD Hits 6.8x Speedup Over FP32 on A100 GPUs

Replacing AllReduce Chains with a Single Gram Matrix AllReduce

Real Speedups on Logistic, Linear, and Poisson Problems

More in Artificial Intelligence