Source linked

Edge Prompt recorta 99.9 tokens por llamada sin sacrificar la calidad del LLM

Un nuevo tubo de destilación llamado SPSD comprime el flujo social de las solicitudes de usuario en el dispositivo, ahorrando 99,9 tokens por llamada y reduciendo la energía en 70-270 uWh mientras se ajusta a la calidad de respuesta cruda.

spsdllm inferenceedge computingprompt compressiongemma 2 2b instructllama 3 1 8b instruct

Compressing conversational fluff like "I hope you're doing well" from LLM prompts saves 99.9 tokens per call without hurting response quality, according to the paper "Closing the Social-Semantic Gap: SPSD for Edge-Based Prompt Compression in Cloud LLM Inference."

The Social-Semantic Gap Costs Real Energy

Every chat bot prelude with polite preamble, apologetic framing, or repetition adds tokens that carry near-zero marginal information for machine reasoning. The authors call this the Social-Semantic Gap, and it is a real line item on cloud inference bills. For consumer-facing LLM apps, the prefill stage eats significant energy. SPSD (Sentiment Preserving Semantic Distillation) slashes that waste at the edge before transmission.

How SPSD Works and What It Achieves

SPSD runs a 4-bit quantised Small Language Model (Gemma-2-2B-Instruct with Q4_K_M) on the user's device to distill the prompt, then sends the compressed version to a cloud Llama-3.1-8B-Instruct. In a 248-prompt corpus, all 146 distilled calls produced positive savings, with a mean of 99.9 input tokens saved per call. That matters for latency and cost.

Response quality was measured by a blind LLM-as-judge on a 15-point rubric. The judge awarded 43 percent ties, 28 percent wins for the distilled path, and 29 percent wins for the raw path. Within a pre-specified non-inferiority margin of 1 point, SPSD is statistically no worse than the full prompt. Cosine similarity hits a median of 0.712, with 54.1 percent of pairs above the 0.70 reference threshold. Mixed, but acceptable for non-critical applications.

Energy and Safety

Per-call net energy saving is estimated at 70 to 270 uWh, depending on prompt complexity and device model. Those numbers add up across millions of interactions. Safety-critical domains get conservatively routed to passthrough via rule-based gates, ensuring no compression-induced hallucinations for sensitive requests.

A practical edge distillation pipeline that cuts token cost by ~100 tokens per call while preserving response quality is ready for real-world deployment. The next question is whether device-side SLMs can run on today's mobile hardware without noticeable battery drain.


Source: Closing the Social-Semantic Gap: SPSD for Edge-Based Prompt Compression in Cloud LLM Inference
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.