ANNS-AMP reduce la energía de búsqueda del vecino más cercano por 1100x con precisión adaptativa

Q: What is the significance of: ANNS-AMP reduce la energía de búsqueda del vecino más cercano por 1100x con precisión adaptativa?

La computación adaptativa de precisión mixta para la búsqueda aproximada del vecino más cercano logra una aceleración de 163 veces sobre la CPU y una reducción de energía de 1100 veces, manteniendo la precisión en un 2,7%.

1100x energy reduction on nearest neighbor search isn't a typo — it's what ANNS-AMP delivers by adapting arithmetic precision on the fly. The framework targets the dominant bottleneck in modern LLM and recommendation pipelines: computing distances between a query and millions of high-dimensional vectors, most of which are irrelevant. Traditional accelerators burn fixed-precision cycles on every comparison. ANNS-AMP instead asks which vectors deserve full 32-bit attention and which can be graded with a handful of bits.

How ANNS-AMP Chooses Precision Per Cluster

The key structural insight: vector space isn't uniform. Clusters closer to the query in PQ (product quantization) space need finer resolution to preserve top-k ordering; far-away clusters can tolerate coarser arithmetic. ANNS-AMP introduces a lightweight runtime predictor that examines per-cluster features — scale, radius, and query distance — to decide a precision level at inference time. The predictor itself is cheap enough to run on the bit-serial compute array without stalling the pipeline. No static precision schedule, no one-size-fits-all truncation.

Bit-Serial Engine and Greedy Scheduling

To execute variable-precision distance calculations efficiently, the team built a bit-serial accelerator with a bit-interleaved data layout. Throughput scales linearly with reduced precision: a 4-bit comparison completes eight times faster than a 32-bit one. The real challenge is load imbalance — different clusters running at different bitwidths can leave compute units idle. ANNS-AMP's greedy scheduling strategy assigns work to processing elements in a way that keeps all lanes busy, mitigating memory bandwidth stalls. The architecture reuses the same bit-serial array for the predictor itself, avoiding dedicated hardware for classification.

Speedups That Scale and Energy That Vanishes

On standard ANNS benchmarks (SIFT1M, GIST, DEEP, etc.), ANNS-AMP achieves an average 163.76x speedup over a CPU baseline, 10.57x over a GPU implementation, and 2.06x over a prior custom ANNS accelerator. Energy consumption drops by an average of 1100x, 39.41x, and 6.66x respectively — the CPU comparison is especially brutal because fixed-precision memory accesses dominate power draw. Accuracy loss stays below 2.7% across all evaluated recall targets. These aren't cherry-picked outliers; they're averages over multiple datasets and recall settings.

ANNS-AMP's adaptive precision scheme turns the old trade-off between speed and accuracy into a continuum controlled by a cheap hardware predictor. Expect this runtime-adaptive approach to migrate into other distance-intensive kernels like k-means clustering and k-NN classification, where the same cluster-precision insight applies.

Source: ANNS-AMP: Accelerating Approximate Nearest Neighbor Search via Adaptive Mixed-Precision Computing
Domain: arxiv.org

ANNS-AMP reduce la energía de búsqueda del vecino más cercano por 1100x con precisión adaptativa

How ANNS-AMP Chooses Precision Per Cluster

Bit-Serial Engine and Greedy Scheduling

Speedups That Scale and Energy That Vanishes

More in Systems Engineering