Source linked

ASTRA-sim 3.0 モデル分散 ML at Cache-Line Load-Store Granularity

キャッシュラインサイズで各GPUのロードとストレージをシミュレートすると、スケーラビリティと忠実性がバランスをとり、コレクティブアルゴリズム、ネットワーク、GPUアーキテクチャの設計における新たなトレードオフが明らかになります。

astra simdistributed machine learninggpu simulationinfra graphcollective communicationhigh fidelity simulation

ASTRA-sim 3.0 simulates distributed machine learning at cache-line-sized load-store granularity, matching what a real GPU sees when handling collective communication. That level of detail – tracking every 64-byte chunk from the L1 cache to the network – makes latency-sensitive model inference simulation suddenly credible.

Why Cache-Line Granularity Matters for Distributed Training

Prior simulators abstracted away memory hierarchy, treating GPU accesses as uniform. That works for throughput-bound training, but inference and latency-sensitive collectives live or die on the exact ordering and timing of small transfers. ASTRA-sim 3.0 couples a detailed GPU execution model with this fine-grained memory simulation, capturing stalls, bandwidth contention, and control-flow bottlenecks that higher-level models miss. The authors claim this balances simulation scalability with fidelity – you can simulate a full cluster without waiting a month.

InfraGraph: A Common Language for Network Infrastructure

The paper introduces InfraGraph, a standardized representation for describing the network topology, link capacities, routing, and interconnect delays. Instead of each research group inventing their own JSON or YAML schema, InfraGraph aims to be the shared vocabulary that lets collective algorithm designers compare apples to apples. ASTRA-sim 3.0 uses InfraGraph to automatically derive feasible communication schedules.

What the Simulator Already Reveals

Early design space explorations using ASTRA-sim 3.0 demonstrate tradeoffs that earlier simulators missed: small changes in GPU L2 cache size shift the optimal all-reduce algorithm; network topology choices that look equal under bandwidth-only modeling show 20% latency differences when cache-line effects are included. The simulator is open-source and community-driven, so anyone building the next generation of distributed training infrastructure can reproduce and extend those results.

ASTRA-sim 3.0 turns the simulation of distributed ML from a coarse estimate into a tool that can guide real hardware and network decisions, one cache line at a time.


Source: ASTRA-sim 3.0: Next-Level Distributed Machine Learning Simulations via High-Fidelity GPU and Infrastructure Modeling
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.