Source linked

MPK Compiler Cuts Multi-GPU LLM Inference Latency by Up to 1.7x

Mirage Persistent Kernel fuses tensor program kernels into one mega-kernel using SM-level task graphs and decentralized scheduling, pushing LLM serving closer to hardware limits.

miragempkmulti gpu inferencekernel fusionllm servingcuda

Up to 1.7x lower end-to-end inference latency on multi-GPU LLM serving – that’s what the Mirage Persistent Kernel (MPK) compiler and runtime delivers, and it does it by throwing out the standard kernel-per-operator execution model.

I have seen countless papers propose kernel fusion, but most stop at fusing operators within a single GPU or require hand-tuned kernels. MPK operates at an entirely different level: it builds an SM-level graph representation that captures data dependencies at the granularity of individual streaming multiprocessors. That lets it pipeline across operators and overlap computation with communication at a fine grain – optimizations that are structurally impossible when each kernel launches separately.

SM-Level Task Graphs Replace Kernel-Per-Operator

The core insight is that the GPU’s own parallelism can be used to schedule work across SMs, rather than treating each kernel launch as a black box. MPK’s compiler takes tensor programs written in existing programming models and lowers them into optimized SM-level task graphs. For each task, it generates fast CUDA implementations. The runtime then executes all those tasks within a single persistent mega-kernel that never returns to the host until the entire inference pass is done.

Decentralized Scheduling Across SMs

No centralized scheduler – the MPK runtime uses decentralized scheduling across SMs. Each SM picks up ready tasks from its local queue, and the mega-kernel keeps all SMs busy with useful work instead of idle waiting at kernel launch boundaries. This is how MPK eliminates the overhead of repeated kernel launches and global synchronization that plagues the conventional approach.

Evaluation: LLM Inference Close to Hardware Limits

The authors benchmarked MPK against existing kernel-per-operator LLM serving systems. Across multiple models and GPU configurations, MPK achieved up to 1.7x lower end-to-end latency. More importantly, the paper shows performance that approaches the theoretical limits of the underlying hardware – something that hand-tuned kernels often miss because they cannot adapt to runtime conditions.

MPK is open source at https://github.com/mirage-project/mirage, so anyone can reproduce these results or adapt the compiler to their own multi-GPU inference pipelines. This is the kind of systems work that makes you rethink what a GPU runtime should look like.


Source: MPK: A Compiler and Runtime for Mega-Kernelizing Tensor Programs
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.