Source linked

Strix Halo Cluster Hits 5μs RDMA Latency for vLLM Inference

Two-node AMD Strix Halo cluster using RoCE v2 achieves 5μs latency, enabling tensor-parallel inference on 128GB unified memory without a switch.

amdstrix halordmaroce v2vllmdistributed inference

Inter-node latency drops from 70–100µs to ~5µs when you link two AMD Strix Halo machines via RoCE v2, and that difference separates unusable from seamless distributed inference.

kyuz0’s setup guide on GitHub walks through building a two-node cluster using Framework Desktop mainboards with the AMD Ryzen AI MAX+ “Strix Halo” APU—128GB of unified memory per node, Intel E810 100GbE NICs, and a direct-attach DAC cable (no switch required). The whole thing runs Fedora 43 with in-kernel ice/irdma drivers and a custom Podman container that auto-detects RDMA devices.

From 100µs to 5µs: Why RoCE v2 Matters

Without RDMA, TCP/IP overhead adds 70–100µs of latency per message. For Tensor Parallelism (TP), two nodes must exchange partial results after every single layer of the neural network—thousands of times per second. 100µs adds up to seconds of stall per token.

RoCE v2 writes data directly from one node’s memory to the other’s, bypassing the CPU and kernel. Latency drops to ~5µs. Bandwidth hits ~50Gbps over a 100GbE link, limited by the motherboard’s PCIe x4 slot (a $20 riser handles the x4-to-x16 conversion).

Two Framework Desktops, No Switch Needed

The guide uses two boards, one with a factory-modified PCIe slot (not recommended) and the other with a standard CY PCI-E 4x-to-16x extender. Performance is identical. Static IPs on a /30 subnet (192.168.100.1 and .2), MTU 9000, firewall trust on the RDMA interface. No switch required for a two-node cluster—the DAC cable connects the E810 cards directly.

Verified firmware: Intel E810 with version 4.91 0x800214b5 1.3909.0. The refresh_toolbox.sh script automatically detects InfiniBand/RDMA devices and exposes them inside the container.

The Software Stack: Fedora, vLLM, Ray, RCCL

vLLM handles inference with Tensor Parallelism. Ray orchestrates the cluster’s control plane. RCCL (AMD’s equivalent of NCCL) drives the data plane over RoCE v2. The guide ships a custom librccl.so patch to ensure the APU’s unified memory is treated as a GPU device for RCCL operations.

Launching a model is a two-step TUI: start the Ray cluster, then select “Launch VLLM Serve” and pick your model. Export HF_TOKEN for gated models. The guide lists tested kernel versions (6.18.5 and 6.18.6 on Fedora 43) and warns that older firmware needs the Intel NVM Update Tool.

For anyone who has two Strix Halo boards and a 100GbE card, this guide turns them into a single coherent inference server that can run models larger than 128GB—without the cost of a switch or proprietary InfiniBand hardware. The next step is scaling beyond two nodes, but for now the latency numbers prove that AMD’s unified memory and RoCE v2 can compete with NVIDIA’s NVLink for distributed inference on a shoestring budget.


Source: AMD Strix Halo RDMA Cluster Setup Guide
Domain: github.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.