ANNS-AMP сокращает энергию поиска ближайшего соседа на 1100 раз с адаптивной точностью

Q: What is the significance of: ANNS-AMP сокращает энергию поиска ближайшего соседа на 1100 раз с адаптивной точностью?

Адаптивные вычисления с смешанной точностью для приблизительного поиска ближайшего соседа обеспечивают 163-кратное ускорение по сравнению с процессором и 1100-кратное снижение энергии при сохранении точности в пределах 2,7%.

1100x energy reduction on nearest neighbor search isn't a typo — it's what ANNS-AMP delivers by adapting arithmetic precision on the fly. The framework targets the dominant bottleneck in modern LLM and recommendation pipelines: computing distances between a query and millions of high-dimensional vectors, most of which are irrelevant. Traditional accelerators burn fixed-precision cycles on every comparison. ANNS-AMP instead asks which vectors deserve full 32-bit attention and which can be graded with a handful of bits.

How ANNS-AMP Chooses Precision Per Cluster

The key structural insight: vector space isn't uniform. Clusters closer to the query in PQ (product quantization) space need finer resolution to preserve top-k ordering; far-away clusters can tolerate coarser arithmetic. ANNS-AMP introduces a lightweight runtime predictor that examines per-cluster features — scale, radius, and query distance — to decide a precision level at inference time. The predictor itself is cheap enough to run on the bit-serial compute array without stalling the pipeline. No static precision schedule, no one-size-fits-all truncation.

Bit-Serial Engine and Greedy Scheduling

To execute variable-precision distance calculations efficiently, the team built a bit-serial accelerator with a bit-interleaved data layout. Throughput scales linearly with reduced precision: a 4-bit comparison completes eight times faster than a 32-bit one. The real challenge is load imbalance — different clusters running at different bitwidths can leave compute units idle. ANNS-AMP's greedy scheduling strategy assigns work to processing elements in a way that keeps all lanes busy, mitigating memory bandwidth stalls. The architecture reuses the same bit-serial array for the predictor itself, avoiding dedicated hardware for classification.

Speedups That Scale and Energy That Vanishes

On standard ANNS benchmarks (SIFT1M, GIST, DEEP, etc.), ANNS-AMP achieves an average 163.76x speedup over a CPU baseline, 10.57x over a GPU implementation, and 2.06x over a prior custom ANNS accelerator. Energy consumption drops by an average of 1100x, 39.41x, and 6.66x respectively — the CPU comparison is especially brutal because fixed-precision memory accesses dominate power draw. Accuracy loss stays below 2.7% across all evaluated recall targets. These aren't cherry-picked outliers; they're averages over multiple datasets and recall settings.

ANNS-AMP's adaptive precision scheme turns the old trade-off between speed and accuracy into a continuum controlled by a cheap hardware predictor. Expect this runtime-adaptive approach to migrate into other distance-intensive kernels like k-means clustering and k-NN classification, where the same cluster-precision insight applies.

Source: ANNS-AMP: Accelerating Approximate Nearest Neighbor Search via Adaptive Mixed-Precision Computing
Domain: arxiv.org

ANNS-AMP сокращает энергию поиска ближайшего соседа на 1100 раз с адаптивной точностью

How ANNS-AMP Chooses Precision Per Cluster

Bit-Serial Engine and Greedy Scheduling

Speedups That Scale and Energy That Vanishes

More in Systems Engineering