90x Remote Traffic Swing montre pourquoi les commandes CTA comptent sur les GPU multi-chip

For the same GEMM dimensions on a multi-chiplet GPU, remote HBM traffic can swing by 90x depending on how you lay out operands, order CTA traversals, and place data — and a new functional simulator just gave designers a concrete way to explore that space.

Why Multi-Chiplet GPUs Force a New GEMM Optimization Knob

Multi-chiplet GPUs split memory into local and remote HBM regions across a silicon interposer. Every remote access burns extra energy and consumes inter-chiplet bandwidth. For GEMM — the dominant operator in large language models — the resulting inter-chiplet traffic depends strongly on kernel choices: operand layout, CTA traversal order, and data placement. The optimal strategy to minimize remote accesses isn't obvious, and brute-force simulation of full-size LLM GEMM configurations is too slow.

90x Variation in Remote Traffic — and a 5.1x Improvement from 2D Swizzle

The authors present a fast, functional, tile-level locality simulator that models CTA scheduling, per-chiplet L2 caches, and local/remote HBM accesses. Across representative LLM GEMMs, the simulator shows remote traffic varies by up to 90x within the same GEMM dimensions. That's not a rounding error — it's a design space the size of a canyon. Using the simulator as feedback, an agentic AI discovered that a 2D block-swizzle CTA traversal reduces remote traffic over the best 1D traversal by up to 5.1x under round-robin placement.

Turning CTA Traversal Order into a First-Order Design Knob

CTA traversal order has been a second-class citizen in most GEMM tuning discussions. The 5.1x improvement from a simple change in traversal pattern changes that. The simulator runs fast enough to evaluate full-size LLM GEMM configurations, making it practical to integrate into automated design-space exploration. The paper identifies CTA traversal order as a GEMM-dependent, first-order knob for inter-chiplet traffic on multi-chiplet GPUs — exactly the kind of concrete lever hardware architects need when squeezing every last drop of bandwidth for LLM training and inference.

Source: A Fast Locality Simulator for GEMM Design-Space Exploration on Multi-Chiplet GPUs
Domain: arxiv.org

90x Remote Traffic Swing montre pourquoi les commandes CTA comptent sur les GPU multi-chip

Why Multi-Chiplet GPUs Force a New GEMM Optimization Knob

90x Variation in Remote Traffic — and a 5.1x Improvement from 2D Swizzle

Turning CTA Traversal Order into a First-Order Design Knob

More in Systems Engineering