First end-to-end RAG pipeline that runs every neural stage — embedding, reranking, and LLM generation — on a mobile NPU has been benchmarked on the Snapdragon X Elite's Hexagon NPU, and the numbers are stark: 18.1x faster LLM prefilling than the CPU, and 4.0x lower end-to-end query latency.
18x Faster Prefill, 4x Less Energy: The NPU Baseline
Profiling on a Dell XPS 13 laptop, the team compared NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines. On indexing, the NPU achieved 9.1x higher embedding throughput and 12.3x less system energy. On a 120-query Wikipedia-passage benchmark, LLM prefilling hit 18.1x speedup over CPU. The integrated GPU was actually 1.7x slower than CPU and consumed 6.5x more energy than the NPU — making the NPU the only viable path for sustained on-device RAG.
GPT-4.1 Says Answers Are Indistinguishable
A GPT-4.1 LLM-as-judge evaluation scored answer quality on a 1-10 rubric. NPU scored 9.32, CPU 8.95, GPU 9.03 — all within evaluator noise. 86.7% of queries scored identically across all three backends. No quality regression despite moving all compute off the CPU onto a purpose-built neural accelerator.
Why This Changes the On-Device AI Calculus
Running RAG entirely on-device has always hit the wall of CPU energy draw. The Hexagon NPU breaks that barrier, enabling private, offline, and latency-savvy retrieval-augmented generation without burning through battery. The paper confirms that comparable mobile NPUs — Apple Neural Engine, Intel NPU, MediaTek APU — should see similar gains as their software stacks mature.
Expect on-device RAG to become as routine as local photo processing — and just as efficient.
Source: Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite
Domain: arxiv.org
Comments load interactively on the live page.