A random forest trained on just 54 molecular dynamics benchmark runs predicts loop time with 4.0% relative error and can rank optimal MPI+OpenMP configurations — but only when the target stays within the same architectural regime.
What 54 Runs Reveal About 1,728 Nodes
Every HPC allocation manager knows the pain: exhaustively benchmarking hybrid MPI+OpenMP configurations eats budget proportional to the grid size. A study out of CENAPAD-SP (Lovelace cluster, AMD EPYC 7662) asked whether a cold-start random forest, trained once on a structured 54-run dataset, could replace those runs for future recommendations.
The dataset: 54 LAMMPS+SPICA runs of the antimicrobial peptide Tritrpticin on a hydrated DOPC bilayer (4,354 coarse-grained beads). Spans 18 hybrid configurations across 1–8 nodes, three replications each. Nine topology and resource features feed five regressors that predict loop time and four internal LAMMPS timing fractions (Pair, Kspace, Comm, Modify).
In-sample, mean absolute error on loop time is 0.49 seconds — 4.0% relative. Feature importance tells a sharper story: predictive signal lives almost entirely in OpenMP threads and the MPI/OpenMP ratio. Raw node count and core count contribute under 3%. So the model isn't learning "more nodes = faster"; it's learning the subtle interplay of thread placement.
Where the Surrogate's Recommendations Fail (and Why)
Generalization is the real test. Leave-one-dimension-out experiments reveal that accuracy is governed by hardware-regime membership. Within a single-node, multi-node, or shared-threading tier, the surrogate ranks configurations correctly. Cross a regime boundary — say, recommend a single-node config for a target that really needs multi-node — and prediction quality collapses.
This isn't a bug; it's a map. The surrogate provides an interpretable trust boundary: use it to scope new benchmark campaigns within a known regime, not to extrapolate across architectures. The result is an explicit cost-benefit window where you can skip 80–90% of brute-force runs and still identify high-performing configs.
Practical Takeaways for HPC Benchmarking
No, 54 runs won't replace your next full-scale production tuning. But this work shows that a small, carefully structured benchmark dataset — with explicit replication and feature engineering — can yield a surrogate that knows its own limits. The Lovelace cluster results give a concrete template: collect a grid of hybrid configs on a few nodes, train a random forest, extract the feature importances and regime map, then run targeted verification instead of a full factorial sweep.
Next step that matters: testing this methodology on heterogeneous hardware (GPU-accelerated nodes, different interconnects) and larger configuration spaces with more than 18 combinations. The interpretability approach scales; the 54-run budget might not, but the pattern is transferable.
Source: How far does a random forest generalize from a 54-run LAMMPS+SPICA benchmark?
Domain: arxiv.org
Comments load interactively on the live page.