Source linked

EnerInfer reduce la energía del LLM en el dispositivo en un 65% sin bajar los tokens

Un nuevo marco predice la frecuencia óptima de NPU y memoria para cada modelo y condición de tiempo de ejecución, reemplazando el perfil de fuerza bruta con un loop de retroalimentación basado en el ranking.

enerinferon device inferenceenergy efficiencylarge language modelsnpumemory frequency

Up to 65% less battery drain on phones, 12% on laptops, and 24% on dev boards, all while the user sees the same token generation speed. That's the claim EnerInfer makes for on-device LLM inference, and the paper backs it with real hardware measurements.

The Problem: Speed-Optimal Is Not Energy-Optimal

Existing on-device LLM systems optimize solely for decoding speed, assuming faster is always better. That assumption leaves massive energy and thermal efficiency on the table. EnerInfer's key insight: you can often drop NPU and memory frequencies modestly, staying well within quality-of-experience (QoE) constraints, while cutting energy and heat substantially.

The catch: the most efficient frequency pair depends on the model, the inference engine, the specific phone or laptop, even the current thermal state. No single configuration wins across all combinations, and commercial devices lack the component-level power sensors needed for direct measurement.

How EnerInfer Works: Prediction Over Profiling

EnerInfer replaces the usual per-model profiling with a two-part system. First, a model-structure-aware predictor estimates throughput and power for unseen LLMs across NPU/DDR frequency settings without ever running the model on those settings. Second, a ranking-driven online feedback loop picks the configuration that meets QoE targets while minimizing energy under actual runtime interference.

A lightweight limited-horizon thermal predictor then watches shell temperature evolve with request arrivals and response lengths, dynamically switching between energy-optimized and thermally constrained inference modes. No sensor-heavy hardware mods required.

The Numbers That Matter

The team ran real LLMs on phones, a laptop, and a development board. Energy efficiency gains hit 65% on phones, 12% on the laptop, and 24% on the dev board. Crucially, QoE was never violated: token generation latency stayed within user-tolerable bounds.

For engineers shipping on-device AI, this is the kind of systems work that turns a promising capability into a shippable product. EnerInfer proves you don't need exotic sensing or exhaustive profiling to run LLMs sustainably on the devices users actually carry.


Source: EnerInfer: Energy-Aware On-Device LLM Inference
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.