Source linked

Snapdragon NPU Runs RAG 18x Faster, Uses 12x Less Energy Than CPU

Running a full RAG pipeline - embedding, reranking, and LLM generation - on the Qualcomm Hexagon NPU of the Snapdragon X Elite delivers 18.1x faster LLM prefilling and 86.7% identical answer quality vs. CPU.

qualcommsnapdragon x elitehexagon npuragon device aienergy efficiency

First end-to-end RAG pipeline that runs every neural stage — embedding, reranking, and LLM generation — on a mobile NPU has been benchmarked on the Snapdragon X Elite's Hexagon NPU, and the numbers are stark: 18.1x faster LLM prefilling than the CPU, and 4.0x lower end-to-end query latency.

18x Faster Prefill, 4x Less Energy: The NPU Baseline

Profiling on a Dell XPS 13 laptop, the team compared NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines. On indexing, the NPU achieved 9.1x higher embedding throughput and 12.3x less system energy. On a 120-query Wikipedia-passage benchmark, LLM prefilling hit 18.1x speedup over CPU. The integrated GPU was actually 1.7x slower than CPU and consumed 6.5x more energy than the NPU — making the NPU the only viable path for sustained on-device RAG.

GPT-4.1 Says Answers Are Indistinguishable

A GPT-4.1 LLM-as-judge evaluation scored answer quality on a 1-10 rubric. NPU scored 9.32, CPU 8.95, GPU 9.03 — all within evaluator noise. 86.7% of queries scored identically across all three backends. No quality regression despite moving all compute off the CPU onto a purpose-built neural accelerator.

Why This Changes the On-Device AI Calculus

Running RAG entirely on-device has always hit the wall of CPU energy draw. The Hexagon NPU breaks that barrier, enabling private, offline, and latency-savvy retrieval-augmented generation without burning through battery. The paper confirms that comparable mobile NPUs — Apple Neural Engine, Intel NPU, MediaTek APU — should see similar gains as their software stacks mature.

Expect on-device RAG to become as routine as local photo processing — and just as efficient.


Source: Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.