Source linked

Blackwell Ultra B300 Matches FP64 FFT via FP8 Tensor Core Emulation

A new Ozaki-Bailey FFT method projects 18 ms for a 1024^3 transform at full FP64 on B300, hitting the memory bandwidth roof despite a 30x drop in native FP64 throughput.

nvidiablackwell ultrab300ozaki schemefp8tensor cores

NVIDIA's Blackwell Ultra (B300) delivers only 1.3 TFLOPS of FP64 vector throughput - roughly 30x less than the B200 and well below the bandwidth-parity floor for memory-bound workloads. That would normally kill any hope of high-performance double-precision FFT on this hardware. A new paper from the Ozaki Scheme II team shows otherwise: software emulation via FP8 tensor cores can push a 1024^3 FFT to within spitting distance of the 12.9 ms memory roof.

The FP64 Throughput Disaster on B300

NVIDIA cut B300's native FP64 pipe to save die area for other tensor-core capabilities. At ~1.3 TFLOPS, the chip sits 10x below the native FP64 bandwidth-parity floor of 12.5 TF (derived from 1.56 * B_HBM, with 8 TB/s HBM). Rubin, by comparison, hits 33 TF and is within 4% of that same floor. B300's deficit looks fatal for any compute-bound double-precision kernel.

But the Ozaki Scheme II framework doesn't need native FP64. It routes dense matrix multiply through FP8 tensor cores using a mantissa-sliced Chinese-remainder reconstruction. A companion Part 1 paper covered GEMM, GEMV, stencils, and SpMV. Part 2 adds the fifth canonical primitive: the 3-D FFT.

Ozaki-Bailey FFT: Emulating FP64 via FP8 Tensor Cores

The Ozaki-Bailey FFT decomposes the 3-D transform via Bailey's six-step decomposition, mapping both 1-D FFT phases to GEMMs on FP8 tensor cores. Bailey's small inner factor k ~ sqrt(N) - for N=1024, k=32 - puts the kernel in a regime where the third TME parameter gamma (reconstruction latency) binds rather than amortizes. Garner reconstruction then splits work: Phase A runs inner products on FP8/INT8 tensor cores and completes in about 1 ms for 1024^3 on B300; Phase B handles per-output reduction.

Kulisch Escape Route and Bandwidth Parity Floors

Phase B is the bottleneck. The paper identifies Kulisch fixed-point complete arithmetic as a reformulation that keeps full FP64 accuracy while running entirely on the INT32 SIMT pipe. That sidesteps B300's anemic FP64 pipe entirely. The math works out: B300 needs to meet either the native FP64 floor (it doesn't) or both the INT32 sub-floor (8.25 * B_HBM) and the FP8 floor (170 * B_HBM). B300 meets both. Projected runtime for 1024^3 at full FP64 is ~18 ms, essentially the 12.9 ms memory roof with overhead.

A GPU hits memory-roof FFT parity if it satisfies either the native floor or both Kulisch floors. B300 passes the latter test. If the projection holds in practice, Blackwell Ultra becomes viable for full-FP64 FFT through software alone. The authors call for a libKulisch library and benchmark campaign - expect to see this trickled into HPC and scientific computing stacks soon.


Source: FP8 is All You Need (Part 2): Efficient Ozaki-Bailey Style FFT Through Tensor-core Garner Reformulation and Kulisch Escape Route
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.