TileFuse Fuses Quantization Kernels to Double AMD NPU LLM Throughput

281% faster GEMV over full-precision baselines—that’s the headline number from TileFuse, a new kernel library that finally gets AWQ-style quantized LLM inference running efficiently on AMD’s XDNA2 NPUs.

Why NPUs Stumble on Standard Quantization

Client NPUs like XDNA2 pack AI Engine arrays (4x8 in this case) and promise high throughput under tight power budgets. Problem is, off-the-shelf quantization formats like AWQ (W4A16, W8A16) don’t map cleanly onto the proprietary, limited-control software stacks these NPUs expose. Most deployments end up reshaping the model to match the NPU’s native quantization scheme, which kills portability and often leaves performance on the table.

TileFuse, built by a team at AMD Research, takes the opposite approach: bend the NPU to the model.

Fusing Unpack, Dequant, and GEMM into One Dataflow

Instead of chaining separate unpack, dequant, and GEMM kernels—with all the intermediate memory traffic that implies—TileFuse fuses them into a single kernel. It co-designs weight layout, metadata placement, and array-level dataflow for XDNA2’s AI Engine array. An interleaved pre-tiling layout supports GEMM dimensions up to 32K. The GEMV dataflow is redesigned to saturate the full 4x8 array, not just a subset.

On kernel-level benchmarks, the fused approach hits up to 121.6% speedup for GEMM and 281% for GEMV versus full-precision baselines. Against strong iGPU baselines, TileFuse delivers more than 2x performance and energy-efficiency gains on GEMM.

Real Hardware Results: 2x Prefill, 64% Lower Energy

End-to-end LLM experiments on Ryzen AI laptops tell the real story. TileFuse achieves up to 2.0x lower prefilling latency while consuming more than 64.6% less energy. Those numbers are on actual client hardware, not a simulation.

The implication is direct: XDNA2 can be a practical target for AWQ-style edge LLM inference without forcing the model into a vendor-specific quantization straitjacket. TileFuse proves that native NPU support for off-the-shelf quantization is not only possible—it makes NPUs substantially more usable in real client deployments.

Expect similar fused-kernel approaches to appear for other vendor NPUs as the ecosystem realizes that the right way to handle quantized LLMs is to teach the hardware to speak the format, not the other way around.

Source: TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs
Domain: arxiv.org

TileFuse Fuses Quantization Kernels to Double AMD NPU LLM Throughput

Why NPUs Stumble on Standard Quantization

Fusing Unpack, Dequant, and GEMM into One Dataflow

Real Hardware Results: 2x Prefill, 64% Lower Energy

More in Artificial Intelligence