On an NVIDIA A100, the CUDA backend runs FFTs more than 4x faster than multithreaded FFTW on a 64-thread AMD EPYC Zen2 processor. The trick: simulate a quantum Fourier transform (QFT) circuit on classical hardware.
QFT Circuits as FFT Drop-Ins
The new library, called QFT→FFT, maps input arrays directly to state amplitudes of a quantum computer simulator. Normalization and indexing are handled explicitly, so the QFT circuit becomes a drop-in replacement for traditional FFT primitives. A backend-agnostic planner builds a fused-gate schedule and memory layout adapters that boost arithmetic intensity and cut data movement.
The implementation sits atop Google's C++ qsim and currently supports OpenMP, AVX, and CUDA backends. On an AMD EPYC Zen2, the AVX backend matches multithreaded FFTW performance at 64 threads - already respectable. But the real win is on GPU hardware.
CUDA Crushes the Baseline
At larger transform sizes, the A100 CUDA backend cuts wall-clock time by a factor of 4 compared to the best AVX or FFTW runs on the same CPU. That’s a direct consequence of fusing quantum gates into a schedule that maps well to GPU warp execution and shared memory.
The paper also introduces an approximate QFT (AQFT) variant that truncates small-angle controlled rotations beyond a cutoff $k$. This reduces circuit depth and runtime while preserving accuracy - useful when you don’t need full double-precision FFT output.
What This Enables
QFT→FFT is a reminder that classical simulation of quantum circuits isn’t just a toy for verification - it can outrun mature HPC libraries on conventional hardware. Expect this approach to scale to larger FFT sizes and influence how future FFT libraries are designed, especially on GPU-heavy clusters where memory bandwidth is the bottleneck.
Source: Not Your Usual FFT: QFT$\rightarrow$FFT via Classical Quantum-Circuit Simulation
Domain: arxiv.org
Comments load interactively on the live page.