Source linked

FP8 + Ozaki Scheme II Recovers Full FP64 on B300 at 500 TFLOPS

A new paper shows that FP8 tensor cores paired with the Chinese Remainder Theorem can emulate FP64 faster than native hardware on NVIDIA's Blackwell Ultra, blowing past a 31x regression.

nvidiablackwell ultrab300ozaki schemefp8hpc

Native FP64 on NVIDIA's Blackwell Ultra (B300) sits at a pathetic ~1.3 TFLOPS—a 31x regression from the B200. That's not a typo. Memory-bound kernels like SpMV and GEMV suddenly become compute-bound on a chip designed for AI tensor throughput.

Why B300's FP64 Collapse Is Actually Irrelevant

A new paper on arXiv argues the whole premise of native FP64 silicon is obsolete. The authors propose using FP8 tensor cores combined with the Ozaki Scheme II, a method based on the Chinese Remainder Theorem, to reconstruct full FP64 accuracy via multiple lower-precision multiply-accumulates. On B300, this emulated FP64 hits ~500 TFLOPS. On AMD's upcoming Rubin R200, it reaches ~400 TFLOPS. Both numbers exceed B200's native FP64 ceiling by over an order of magnitude in compute-bound regimes.

Tensor-Memory Equilibrium: A Better Roofline

The paper introduces the Tensor-Memory Equilibrium (TME) model, augmenting the classic roofline with a compute multiplier alpha, a bandwidth multiplier beta, and a reconstruction latency gamma. The key insight: register-level fusion makes the bandwidth multiplier beta approach 1, meaning the memory wall hides the emulation cost. The emulation becomes effectively free behind memory latency.

Every Kernel Class Hits the Memory Roof

Against an H100 baseline, the Ozaki II method matches or exceeds H100 on every workload studied. B300 native FP64 imposes up to a 50x regression. The companion paper (Part 2) handles FFT via Kulisch fixed-point reconstruction on the surviving INT32 pipe and FP32+Kahan reductions. Together, they cover the canonical HPC kernel spectrum: SpMV, GEMV, stencils, FFTs—all reach the memory roof at full FP64.

The evidence is clear: FP8 tensor cores, augmented with Ozaki II and Kulisch escape routes, eliminate the need for native FP64 hardware. Next-generation HPC procurement should budget for tensor throughput, not double-precision silicon.


Source: FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.