Source linked

PALUTE ejecuta LLM en 1.264 TPS en 0.16W usando búsquedas en DRAM

A 0,16 watts, PALUTE procesa 1,264 tokens por segundo en un modelo Qwen3-4B mediante la realización de consultas de tabla de búsqueda directamente dentro de DRAM, superando los aceleradores anteriores por 12,8x en eficiencia energética.

paluteprocessing in memorylarge language modelsedge aidramlut

1,264 tokens per second at just 0.16 watts—that's what PALUTE delivers for a Qwen3-4B model, and it does it by turning DRAM into a lookup table engine.

The LUT Problem and PALUTE's In-DRAM Fix

Quantized LLMs still burn power on dequantization and nonlinear operators. Lookup tables (LUTs) replace repeated arithmetic with memory reads, but prior designs choked on capacity and lookup latency. PALUTE, built on Monolithic 3D (M3D) DRAM, sidesteps both by executing LUT queries inside the DRAM memory array tiles themselves. The vertical organization of M3D DRAM gives high parallelism without the area overhead of separate table storage.

A near-memory LUT generator handles both GEMM and element-wise unary nonlinear operators, keeping generation latency low. The system-level tiering and scheduling strategy then minimizes data movement across memory tiers—a critical win when power is capped at 0.16 W.

How PALUTE Achieves 1,264 TPS at 0.16W

Cycle-accurate simulation and RTL synthesis back those numbers. On a Qwen3-4B model with W4A4 quantization, PALUTE hits 1,264 TPS end-to-end throughput. That's not burst throughput; that's sustained inference.

The secret is in-DRAM LUT access. Instead of shuttling data between compute logic and memory, PALUTE keeps the lookup operation entirely within the DRAM tile. The result: energy efficiency of 12.8x over CHIME and 1.6x over FIGLUT, two prior state-of-the-art edge LLM accelerators.

Beating CHIME and FIGLUT by an Order of Magnitude

Area efficiency matters just as much for edge silicon. PALUTE achieves 2.0x better area efficiency than PIMPAL under the same W4A4 configuration. That means more performance per square millimeter on a die budget that barely has room for a scratchpad.

For edge deployments that need a 4B-parameter model at sub-watt power, PALUTE's M3D DRAM approach sets a new baseline—lookup tables are no longer a memory bottleneck.


Source: PALUTE: Processing-In-Memory Acceleration via Lookup Table for Edge LLM Inference
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.