Source linked

EVPN-VXLAN with Queue-Pair Routing Tames Geo-Distributed AI トレーニング

新しいエミュレーションフレームワークは、VXLAN オーバーレイヤー、EVPN、および列の両方認識のトラフィック配布を組み合わせると、AllReduce および Parameter Server ワークロードの同期要求に対応できることを示しています。

evpnvxlanscaleacrossgeo distributed ai trainingcontainerlabfrrouting

Data sovereignty laws are forcing AI training workloads to span multiple data centers, and the synchronization overhead of AllReduce across 1000 km of fiber is brutal. The ScaleAcross paper (arXiv 2606.12963) proposes a practical infrastructure stack built on EVPN-VXLAN overlays that might actually make this work.

EVPN-VXLAN and the WAN Latency Problem

Cross-data-center AI training introduces three killers: synchronization-intensive communication, cross-site data exchange, and wide-area latency constraints. Standard Layer-2 extension techniques don't cut it for AllReduce or Parameter Server patterns. The authors pick EVPN-VXLAN as the foundation because it provides multi-tenancy and inter-data-center connectivity without sacrificing commodity hardware compatibility. They combine it with Equal-Cost Multi-Path (ECMP) routing and Bidirectional Forwarding Detection (BFD) for fast failover—table stakes for WAN resilience.

Queue-Pair-Aware Routing for Synchronization Traffic

Here's the clever bit: a queue-pair-aware traffic distribution mechanism. Instead of blindly hashing flows across paths, this approach understands that synchronization operations (like AllReduce gradients) need tight coupling between matching send and receive queues. By routing both sides of a queue pair onto the same path, you reduce out-of-order delivery and head-of-line blocking. The framework implements this as a lightweight shim over standard VXLAN, so you don't need custom ASICs.

Emulation with ContainerLab and FRRouting

The authors built a reproducible emulation environment using ContainerLab and FRRouting (FRR) to run realistic WAN scenarios. They tested both AllReduce and Parameter Server communication patterns under varying latency and bandwidth constraints. Results cover traffic distribution behavior and infrastructure resilience—showing that the queue-pair-aware approach keeps synchronization overhead manageable even when cross-site links degrade. No synthetic benchmarks here; the WAN emulation uses real delay profiles.

The emulation stack is fully open and reproducible. If you want to validate your own multi-DC training topology, the framework is the first systematic tool I've seen for this exact problem. Next step is validating against real hardware and multi-region cloud deployments—but the emulation results already point to a viable path for sovereign AI training at scale.


Source: ScaleAcross: Designing Multi-Data-Center Infrastructure for Geo-Distributed AI Training
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.