Source linked

HPC FFT Library Beats FFTW by 4x Using Quantum Circuit Simulation

QFT→FFT maps input arrays to quantum state amplitudes and runs on Google's qsim, matching FFTW on AVX and beating it 4x on an A100.

qft to fftgoogle qsimfftwnvidia a100amd epyc zen2quantum circuit simulation

On an NVIDIA A100, the CUDA backend runs FFTs more than 4x faster than multithreaded FFTW on a 64-thread AMD EPYC Zen2 processor. The trick: simulate a quantum Fourier transform (QFT) circuit on classical hardware.

QFT Circuits as FFT Drop-Ins

The new library, called QFT→FFT, maps input arrays directly to state amplitudes of a quantum computer simulator. Normalization and indexing are handled explicitly, so the QFT circuit becomes a drop-in replacement for traditional FFT primitives. A backend-agnostic planner builds a fused-gate schedule and memory layout adapters that boost arithmetic intensity and cut data movement.

The implementation sits atop Google's C++ qsim and currently supports OpenMP, AVX, and CUDA backends. On an AMD EPYC Zen2, the AVX backend matches multithreaded FFTW performance at 64 threads - already respectable. But the real win is on GPU hardware.

CUDA Crushes the Baseline

At larger transform sizes, the A100 CUDA backend cuts wall-clock time by a factor of 4 compared to the best AVX or FFTW runs on the same CPU. That’s a direct consequence of fusing quantum gates into a schedule that maps well to GPU warp execution and shared memory.

The paper also introduces an approximate QFT (AQFT) variant that truncates small-angle controlled rotations beyond a cutoff $k$. This reduces circuit depth and runtime while preserving accuracy - useful when you don’t need full double-precision FFT output.

What This Enables

QFT→FFT is a reminder that classical simulation of quantum circuits isn’t just a toy for verification - it can outrun mature HPC libraries on conventional hardware. Expect this approach to scale to larger FFT sizes and influence how future FFT libraries are designed, especially on GPU-heavy clusters where memory bandwidth is the bottleneck.


Source: Not Your Usual FFT: QFT$\rightarrow$FFT via Classical Quantum-Circuit Simulation
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.