Source linked

90x Remote Traffic Swing montre pourquoi les commandes CTA comptent sur les GPU multi-chip

Un simulateur de localisation au niveau des tuiles révèle que le trafic HBM à distance pour les GEMM LLM varie jusqu'à 90 fois parmi les choix de conception, et que le blocage 2D le réduit 5,1 fois sur la meilleure traversée 1D.

multi chiplet gpusgemmllmcta traversal2d block swizzlelocality simulator

For the same GEMM dimensions on a multi-chiplet GPU, remote HBM traffic can swing by 90x depending on how you lay out operands, order CTA traversals, and place data — and a new functional simulator just gave designers a concrete way to explore that space.

Why Multi-Chiplet GPUs Force a New GEMM Optimization Knob

Multi-chiplet GPUs split memory into local and remote HBM regions across a silicon interposer. Every remote access burns extra energy and consumes inter-chiplet bandwidth. For GEMM — the dominant operator in large language models — the resulting inter-chiplet traffic depends strongly on kernel choices: operand layout, CTA traversal order, and data placement. The optimal strategy to minimize remote accesses isn't obvious, and brute-force simulation of full-size LLM GEMM configurations is too slow.

90x Variation in Remote Traffic — and a 5.1x Improvement from 2D Swizzle

The authors present a fast, functional, tile-level locality simulator that models CTA scheduling, per-chiplet L2 caches, and local/remote HBM accesses. Across representative LLM GEMMs, the simulator shows remote traffic varies by up to 90x within the same GEMM dimensions. That's not a rounding error — it's a design space the size of a canyon. Using the simulator as feedback, an agentic AI discovered that a 2D block-swizzle CTA traversal reduces remote traffic over the best 1D traversal by up to 5.1x under round-robin placement.

Turning CTA Traversal Order into a First-Order Design Knob

CTA traversal order has been a second-class citizen in most GEMM tuning discussions. The 5.1x improvement from a simple change in traversal pattern changes that. The simulator runs fast enough to evaluate full-size LLM GEMM configurations, making it practical to integrate into automated design-space exploration. The paper identifies CTA traversal order as a GEMM-dependent, first-order knob for inter-chiplet traffic on multi-chiplet GPUs — exactly the kind of concrete lever hardware architects need when squeezing every last drop of bandwidth for LLM training and inference.


Source: A Fast Locality Simulator for GEMM Design-Space Exploration on Multi-Chiplet GPUs
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.