Source linked

MxGLUT Cuts Multiplier Area 57% by Unifying FP8-INT4 and FP8-FP8 GEMM on LUTs

MxGLUT replaces dedicated FP multipliers with a single LUT-based compute mechanism, achieving 0.492 TFLOPS/mm2 area efficiency and up to 2.16× prefill latency speedup across Llama models with minimal perplexity loss.

mxglutfiglutllamamixed precision gemmlut based acceleratorsystems engineering

MxGLUT slashes multiplier area by 56.92% and power by up to 78% by replacing dedicated FP multipliers with a unified LUT-based compute mechanism that handles both FP8-INT4 and FP8-FP8 GEMM.

LLM inference under weight-only quantization demands mixed-precision GEMM—activations in FP8, weights compressed to low-bit integers. Existing LUT accelerators like FIGLUT still bolt on separate FP datapaths for attention GEMM, wasting silicon and complicating execution. MxGLUT from the authors (arXiv:2607.01607) throws out those extra datapaths entirely.

LUTs Replace Dedicated FP Multipliers Across Mixed-Precision GEMM

MxGLUT's core is the MxLPE—a mixed-precision LUT-based processing element. Guided by a unified LUT execution framework, each MxLPE can compute both FP8-INT4 and FP8-FP8 GEMMs without a single dedicated FP multiplier. Synthesized in UMC 28nm CMOS at 200 MHz, this cuts multiplier area 56.92% and power 77.07% in FP8-INT4 mode, and 78.35% power in FP8-FP8 mode versus a conventional design.

Adding native FP8-FP8 support costs only 2.57% area and 3.34% energy-efficiency reduction relative to the FP8-INT4-only FIGLUT baseline. That's a small price for full mixed-precision flexibility.

RLB Dataflow Matches Prefill and Decode Phases

LLM inference has two very different phases: prefill (compute-heavy, partial-sum accumulation) and decode (memory-bound, weight reuse). Static dataflows force a compromise. MxGLUT's reconfigurable LUT-centric broadcast (RLB) dataflow localizes heavy partial-sum accumulation during prefill and exploits weight reuse during decode—no hardware recompile needed.

Real Silicon Numbers: 2.16× Latency Speedup at 1.7% Perplexity Cost

Across the Llama family, MxGLUT delivers up to 2.16× prefill latency speedup and 1.49× decode speedup. Normalized energy drops to 0.44× (prefill) and 0.71× (decode). The accuracy impact is minimal: at most 1.70% perplexity increase. At the accelerator level, area efficiency hits 0.492 TFLOPS/mm² and energy efficiency 11.58 TFLOPS/W.

MxGLUT shows a path to unified LUT-based accelerators that scale across quantization schemes without bespoke FP datapaths—expect this architecture to influence next-generation edge or on-chip LLM accelerators.


Source: MxGLUT: A Reconfigurable LUT-Centric Broadcast Dataflow Accelerator for Mixed-Precision GEMM
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.