Flash-GMM réduit la distance de recherche ANN de 1,7x avec un seul noyau GPU

Gaussian Mixture Models just became a practical drop-in replacement for k-means in billion-scale ANN pipelines, thanks to a single fused Triton kernel that dodges the memory wall.

The core insight is embarrassingly simple once you say it: don't materialize the full N×K responsibility matrix. Flash-GMM computes GMM responsibilities on the fly within a fused kernel, which means the GPU memory footprint no longer scales with the product of dataset size and cluster count. The result is a 20× speedup over existing implementations and the ability to train on datasets 100× larger than what you could fit on one device before.

Where the Real Numbers Live

I care about recall, not just flops. The authors dropped Flash-GMM into the IVF coarse quantizer of an approximate nearest-neighbor search pipeline — the part that normally runs k-means to partition the vector space. Soft GMM clustering with Flash-GMM hits the same recall targets using up to 1.7× fewer distance computations. If you prefer to keep your compute budget fixed, that translates to +2 to +12 recall@10 over the k-means baseline, depending on the dataset.

What Makes It Work

It's a single GPU pass written in Triton — not a dozen CUDA kernels orchestrated by a Python loop. By fusing the E-step (responsibility computation) and the M-step (parameter updates) into one kernel, Flash-GMM avoids the intermediate write-back of an O(NK) matrix that would otherwise kill memory and stall the pipeline. That memory efficiency is the entire trick, and it's what lets you scale to hundreds of millions of points without distributing across a cluster.

The authors release the kernel as open-source. If you've ever wanted to use soft clustering for ANN indexing but stopped because k-means was the only thing that fit in memory, your excuse just evaporated.

Source: Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering
Domain: arxiv.org

Flash-GMM réduit la distance de recherche ANN de 1,7x avec un seul noyau GPU

Where the Real Numbers Live

What Makes It Work

More in Machine Learning