Remote HBM traffic on chiplet GPUs drops 24.7x on Qwen 3 30B and 19.2x on Llama 3.1 70B with a new memory layout called Chiplet-Contiguous Layout — no OS or hardware changes required.
Multi-chiplet GPUs give you more compute and HBM capacity, but the non-uniform memory system punishes naive data placement. Locality-aware scheduling identifies which data should live near each chiplet, but that strategy runs head-on into page-granularity data interleaving. The optimal granularity for mapping data across chiplets varies wildly across matrix shapes, so a fixed interleave size forces a painful tradeoff.
A Layout That Doesn't Fight Pagination
Chiplet-Contiguous Layout stores chiplet-local data contiguously in global memory. That simple change makes locality-aware placement work with standard page-granularity interleaving — no new page sizes, no hypervisor modifications, no hardware rework. The authors evaluated it on representative GEMM shapes from two leading open LLM families: Qwen 3 30B and Llama 3.1 70B, covering both inference and training workloads.
Traffic Reduction That Tells the Story
Relative to a naive 4KB page interleaving, the layout cuts remote HBM traffic by 24.7x on Qwen and 19.2x on Llama. Even compared to a coarse locality-aware placement that groups data without the fine-grained contiguity, Chiplet-Contiguous Layout still delivers 4.1x and 2.1x reductions respectively. Those aren't synthetic benchmarks — those are real GEMM shapes powering production-scale LLMs.
What This Changes for Chiplet GPU Architecture
Existing locality techniques either require expensive page migration or force coarse-grained allocation that wastes memory capacity. This approach keeps the simplicity of fixed page interleaving while achieving the locality benefits of custom mapping. For inference servers running LLMs across multi-chiplet GPUs, that means less cross-chiplet bandwidth pressure, lower energy, and fewer stalls waiting on remote memory.
Chiplet-Contiguous Layout directly enables cheaper chiplet GPU designs that don't need exotic page-table support — the operating system and hardware stay untouched, and the performance win comes purely from the memory allocator.
Source: Making Locality-aware GEMM Compatible with Page-Granularity Placement on Chiplet GPUs
Domain: arxiv.org
Comments load interactively on the live page.