Source linked

StageFrontier ID замедляет распределенную подготовку на 0,2%

Новый постоянный сигнал для распределенной ML-тренинга определяет точный ранг и стадию, вызывающие застой, используя только грубые продолжительности стадии и никаких синхронизированных часов, заменяя гигабайты следов 0,11 Мб резюме.

stagefrontierdistributed trainingpytorchglooncclml systems

Pinning a distributed training slowdown to the culprit rank and stage normally requires either a heavy profiler you can't leave on or gut-feel guesses from averages that lie. StageFrontier fixes that with an always-on mathematical accounting that adds under 0.2% throughput overhead across 128 ranks.

The Synchronization Lie

When a single rank stalls on data load, the synchronization barrier spreads that delay across all ranks. Standard dashboards that compute per-stage averages or maxima double-count the same exposed delay or bury the slow rank in an average. Full profilers like PyTorch Profiler, HTA, and Nsight Systems see the truth clearly—but dumping their traces costs 15.81 GB of storage per run and enough CPU overhead to kill always-on deployment.

Frontier Tracking Without Clock Sync

StageFrontier skips kernel tracing and synchronized clocks entirely. Each rank reports a short ordered vector of coarse stage durations—data, forward, backward—timed with plain CPU wall-clock. At each stage boundary, StageFrontier takes the cumulative time of whichever rank is furthest along. The increments of this frontier form an exact, additive accounting of the step's exposed time, pointing operators to the stage and rank where group-visible delay first appears. It does not guess which fix to make—it tells you where to aim a heavy profiler.

0.11 MB Instead of 15.81 GB

In a hidden-rank DDP test with injected faults across 50 rows, StageFrontier placed the real culprit among its top two suspects every time. Recovering the same top-stage routing as PyTorch Profiler, HTA, and Nsight Systems—once their traces are reduced to the same coarse stages—StageFrontier does it from a 0.11 MB summary. The authors implemented it in PyTorch on Gloo and NCCL, showing the overhead stays below 0.2% even at 128 ranks.

The coarse signal alone cannot always tell whether a leading stage caused the slowdown or merely ran alongside it. StageFrontier labels those windows explicitly—marking where more evidence is needed instead of pretending to know. That honesty makes it a drop-in addition to production training loops, not another dashboard that oversells its certainty.

StageFrontier won't replace your profiler. It tells you exactly where to drop one.


Source: StageFrontier: Synchronization-Aware Stage Accounting for Distributed ML Training
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.