Source linked

PyTorch LP Solver liefert 3.86x Multi-GPU Scaling und Order-of-Magnitude Speedups

Ein verteilter linearer Programmierungslöser, der Spaltenparallelismus und Ridge-Regularisierung verwendet, erreicht nahelineare GPU-Scaling und schlägt DuaLip-Scala in einer Größenordnung auf synthetischen Workloads.

pytorchtritondualip scalagpu clusterslinear programmingad allocation

A distributed multi-GPU LP solver built natively in PyTorch with column-sharded parallelism achieves 3.86x scaling on 4 GPUs and an order-of-magnitude wall-clock speedup over DuaLip-Scala on synthetic workloads. That’s not a simulation — the system also scales beyond the memory ceiling of existing GPU solvers like cuPDLP and D-PDLP under fixed hardware budgets.

Production decision systems for ad allocation or content matching regularly solve linear programs with millions of users and thousands of items. The problem structure is sparse block-diagonal across users. Three system gaps have kept these workloads on slow CPU solvers: memory limits (GPU solvers can’t hold production instances), temporal instability (solution variability across runs causes downstream churn), and rigid interfaces (DuaLip-Scala couples problem formulation to fixed schemas).

Column-Sharded Parallelism with Fused Triton Kernels

The system adopts column-sharded parallelism across GPUs: as users grow, only local computation increases, while communication is limited to a reduction of item-level dual variables. Fused Triton kernels and batched operations cut per-iteration overhead. The result is near-linear scaling — 3.86x on 4 GPUs — something existing GPU solvers can’t deliver at these problem sizes.

Ridge Regularization for Temporal Stability

Solution variability across runs is a real headache for SLAs. The paper introduces ridge-regularized LPs, a control missing from current GPU solvers. A continuation schedule over the regularization parameter trades convergence speed for solution fidelity, giving engineers explicit control over stability without sacrificing accuracy.

Operator-Centric Programming Model

DuaLip-Scala’s schema-bound interface makes adding new constraint families painful. The new system replaces that with composable operator primitives. You can express new formulations without touching the solve loop or distributed infrastructure. That’s extensibility by design, not by hack.

For teams running repeated large-scale matching problems, this system eliminates the memory bottleneck, stabilizes solutions, and speeds up runs by an order of magnitude — making it a strong candidate for pulling production LP workflows off CPUs and onto GPU clusters.


Source: Large-Scale Regularized Matching on GPU Clusters
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.