EmuGEMM Fuses Tensor Core Kernels to Beat cuBLAS by 1.7x on Blackwell

EmuGEMM hits 3,654 Top/s on Blackwell GPUs by fusing integer Tensor Core kernels into a single pass, slashing the memory bottleneck that plagues prior precision-emulation schemes.

How EmuGEMM Kills the Memory Bottleneck

Standard Ozaki Scheme I and II implementations reconstruct high-precision GEMM from low-precision Tensor Core operations, but they repeatedly write intermediate results to global memory. That data movement dominates runtime. EmuGEMM eliminates those round-trips by fusing the entire chain into one fused integer kernel on NVIDIA Hopper and Blackwell architectures.

Schemes that used to spend most of their cycles waiting on DRAM now stream data through registers and shared memory. The paper reports EmuGEMM sustains 1,639 Top/s on Hopper (83% of INT8 peak) and 3,654 Top/s on Blackwell (81% of INT8 peak). Those are efficiency numbers most kernel authors would kill for.

Performance Numbers That Matter

For large matrices, EmuGEMM outperforms cuBLAS TF32 by up to 1.4x on Hopper and 1.7x on Blackwell, at comparable accuracy. That means you get better precision without sacrificing throughput - a rare trade-off in scientific computing.

Scheme II extends the fusion to complex arithmetic. There EmuGEMM beats cuBLAS ZGEMM by 2.3x on Hopper and 5.5x on Blackwell. Complex matrix multiplies are the backbone of quantum chemistry and CFD codes; a 5.5x speedup on Blackwell is the kind of number that makes simulation groups rewrite their solvers.

What This Enables Next

EmuGEMM is not a theoretical toy - it runs on shipping hardware (Hopper H100 and Blackwell B200) and the code is likely to be open-sourced given the paper's detail. The approach shows that fusing precision-emulation steps into a single Tensor Core kernel is viable and efficient, closing the gap between low-precision throughput and double-precision accuracy.

Expect EmuGEMM to become the default drop-in for high-precision linear algebra on NVIDIA GPUs, especially for workloads where cuBLAS double-precision is the bottleneck and TF32 accuracy isn't enough.

Source: EmuGEMM: Fused Tensor Core Kernels for Precision Emulation in Matrix Multiplication
Domain: arxiv.org

EmuGEMM Fuses Tensor Core Kernels to Beat cuBLAS by 1.7x on Blackwell

How EmuGEMM Kills the Memory Bottleneck

Performance Numbers That Matter

What This Enables Next

More in Machine Learning