OpenMP على AMD MI250X يضغط 47 مرة على OpenACC على NVIDIA A100

47x. That’s the slowdown a production magnetohydrodynamics code suffered when ported from OpenACC to OpenMP and run on AMD MI250X versus the NVIDIA A100 baseline. The paper, evaluating directive-based GPU programming by porting gPLUTO onto two exascale-class systems, makes one thing clear: portable performance across GPU architectures is a pipe dream without deep compiler work and architecture-aware tuning.

The Numbers: 3x Application, 10x Kernels, 47x Nightmares

gPLUTO — a real astrophysical simulation code, not a synthetic benchmark — runs OpenACC on NVIDIA A100 (Leonardo Booster) at a baseline that sets the bar. Porting it to OpenMP and running on the same A100 hardware shows comparable performance, thanks to a shared compiler backend. But on AMD MI250X (LUMI-G), the same OpenMP implementation is roughly 3x slower at the application level. Kernel-level profiling reveals individual kernels hitting up to an order of magnitude slower than the A100 OpenACC baseline. In low-parallelism kernels, C++ abstraction layers inflate register pressure and spilling, causing extreme slowdowns of 47x in specific cases.

Memory Latency, Not Bandwidth, Is the Real Bottleneck

Kernel-level profiling pinpoints the dominant runtime contributors as memory-latency-bound, not limited by peak bandwidth. Strided memory-access patterns and compiler limitations on AMD’s ROCm stack amplify the gap. The authors attribute the slowdown to sensitivity in how OpenMP handles these patterns versus OpenACC, plus immature compiler optimizations for AMD GPUs. This isn’t about theoretical peak flops — it’s about whether the compiler can schedule loads and stores efficiently enough to hide latency. Right now, it can’t.

What This Means for Portability Promises

Directive-based models (OpenACC, OpenMP) were sold as a way to write once and run anywhere. This study shows the reality: a single OpenMP implementation performs dramatically differently across NVIDIA and AMD hardware. Even the same source code on the same GPU architecture (NVIDIA A100) but using different directives (OpenACC vs OpenMP) can diverge if compiler backends differ. The takeaway isn’t that directive models are useless — it’s that you cannot pretend they abstract away architecture details. Achieving portable performance requires not only application-level changes but also continued advances in compiler backends and architecture-aware optimization strategies. Until then, expect 47x surprises when you switch targets.

Source: On the Limits of Performance Portability in Directive-Based GPU Programming
Domain: arxiv.org

OpenMP على AMD MI250X يضغط 47 مرة على OpenACC على NVIDIA A100

The Numbers: 3x Application, 10x Kernels, 47x Nightmares

Memory Latency, Not Bandwidth, Is the Real Bottleneck

What This Means for Portability Promises

More in Systems Engineering