RATrain Tames MT-3000's 20GB DDR Limit to Train LLMs at 112K Tokens/s

RATrain manages to train LLaMA-2-7B at 112,790.55 tokens/s across 1024 compute clusters on the MT-3000 supercomputing platform, achieving 97.0% scaling efficiency.

Here's the problem: the MT-3000 has an explicit memory hierarchy with only 20GB usable DDR per compute cluster. No high-bandwidth interconnects. No mature collective communication libraries. Existing GPU-oriented runtimes assume fast device memory and high-speed links, so they choke on this hardware. The authors of RATrain didn't just port existing code—they rethought the training loop from scratch.

1F1B Training as a Scheduling Problem

Standard non-interleaved 1F1B training gets reframed as a training-state lifecycle scheduling problem. RATrain schedules gradient synchronization, parameter update, parameter-view prefetching, and activation recovery at both layer-level and stage-local granularity. That means every chunk of work explicitly accounts for when data arrives and where it lives. No assumptions about instant access.

The runtime couples this scheduler with an MT-3000-aware execution backend that handles FP16 GEMM, Attention Backward, and explicit data movement in a predictable way. A resource-aware planner then selects feasible training configurations under the 20GB DDR cap per cluster. That planner is the difference between crashing and scaling.

Measured Gains: 1.35x and 97% Scaling Efficiency

The team evaluated RATrain on a real MT-3000 system using four model configurations: LLaMA-2-7B, Baichuan2-13B, Qwen2.5-32B, and LLaMA-2-70B. Compared to a naive MT-3000-adapted GPU-style strategy, RATrain delivers up to 1.35x end-to-end speedup.

Scaling LLaMA-2-7B to 1024 clusters produces that 112,790.55 tokens/s throughput with 97.0% scaling efficiency. That's not a simulation—that's on real silicon. To verify correctness, the authors ran a 1.028B-token training run and measured a maximum relative loss deviation of 0.081% against a semantically equivalent Baseline-1F1B run. The loss trajectory stays intact.

What This Enables Next

Bandwidth-constrained heterogeneous supercomputers aren't going away—they're too cost-effective for large-scale deployments. RATrain proves you don't need fat interconnects or unlimited HBM to train dense LLMs at scale. Expect this scheduling framework to influence how other exotic hardware platforms (think CPUs with accelerators, CXL-attached memory) get production training runtimes.

Source: RATrain: A Resource-Aware Training Runtime for Large Language Models on Bandwidth-Constrained Heterogeneous Supercomputing Platforms
Domain: arxiv.org

RATrain Tames MT-3000's 20GB DDR Limit to Train LLMs at 112K Tokens/s

1F1B Training as a Scheduling Problem

Measured Gains: 1.35x and 97% Scaling Efficiency

What This Enables Next

More in Systems Engineering