Source linked

Chiplet-Contiguous Layout Slashes Remote HBM Traffic by 24x on LLM GEMM

A new memory layout cuts remote HBM traffic 24.7x on Qwen 3 and 19.2x on Llama 3.1 without OS or hardware changes

chiplet gpusgemmlocality awarememory layoutmulti chipletllm inference

Remote HBM traffic on chiplet GPUs drops 24.7x on Qwen 3 30B and 19.2x on Llama 3.1 70B with a new memory layout called Chiplet-Contiguous Layout — no OS or hardware changes required.

Multi-chiplet GPUs give you more compute and HBM capacity, but the non-uniform memory system punishes naive data placement. Locality-aware scheduling identifies which data should live near each chiplet, but that strategy runs head-on into page-granularity data interleaving. The optimal granularity for mapping data across chiplets varies wildly across matrix shapes, so a fixed interleave size forces a painful tradeoff.

A Layout That Doesn't Fight Pagination

Chiplet-Contiguous Layout stores chiplet-local data contiguously in global memory. That simple change makes locality-aware placement work with standard page-granularity interleaving — no new page sizes, no hypervisor modifications, no hardware rework. The authors evaluated it on representative GEMM shapes from two leading open LLM families: Qwen 3 30B and Llama 3.1 70B, covering both inference and training workloads.

Traffic Reduction That Tells the Story

Relative to a naive 4KB page interleaving, the layout cuts remote HBM traffic by 24.7x on Qwen and 19.2x on Llama. Even compared to a coarse locality-aware placement that groups data without the fine-grained contiguity, Chiplet-Contiguous Layout still delivers 4.1x and 2.1x reductions respectively. Those aren't synthetic benchmarks — those are real GEMM shapes powering production-scale LLMs.

What This Changes for Chiplet GPU Architecture

Existing locality techniques either require expensive page migration or force coarse-grained allocation that wastes memory capacity. This approach keeps the simplicity of fixed page interleaving while achieving the locality benefits of custom mapping. For inference servers running LLMs across multi-chiplet GPUs, that means less cross-chiplet bandwidth pressure, lower energy, and fewer stalls waiting on remote memory.

Chiplet-Contiguous Layout directly enables cheaper chiplet GPU designs that don't need exotic page-table support — the operating system and hardware stay untouched, and the performance win comes purely from the memory allocator.


Source: Making Locality-aware GEMM Compatible with Page-Granularity Placement on Chiplet GPUs
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.