Source linked

Frontier LLMs Solve Under a Third of Multi-GPU Kernel Problems

ParallelKernelBench throws 87 real-world multi-GPU kernel tasks from production systems at GPT-5.5, Opus 4.7, and others. Best model hits 31% correctness; a few generated kernels beat all public references.

together aiparallelkernelbenchgpu kernelsllm code generationnvidiacuda

28 of 87 problems solved correctly, and only 22 of those beat the naive PyTorch + NCCL baseline. That's the scorecard for the best frontier LLM on ParallelKernelBench (PKB), a new multi-GPU kernel generation benchmark from Together AI.

PKB takes real distributed workloads from Megatron-LM, DeepSpeed, DeepEP, TensorRT-LLM, NeMo-RL, and a long tail of non-LLM code (GNN routing, distributed FFTs, Gaussian splatting) and asks models to replace the PyTorch + NCCL reference with a CUDA kernel that communicates directly over NVLink via symmetric memory.

Why multi-GPU kernel generation is a different beast

Single-GPU kernel benchmarks miss the real bottleneck in production: communication overhead can eat over 20% of inference latency, and that gap widens as compute scales faster than interconnect bandwidth.

Multi-GPU adds a combinatorial design space: tensor, context, data, expert, sequence, and FSDP/ZeRO parallelism each create different communication patterns. The performance model shifts from compute/memory roofline to interconnect roofline. And there's a critical new design choice: move data through the copy engine, TMA, SM load/store, or NVLS, and whether to fuse that movement with compute.

PKB's 87 problems cover that taxonomy. The authors built it by identifying every major parallelism type and pulling reference implementations from production codebases. Because the baselines are standard PyTorch + NCCL, the benchmark isn't tied to a specific hardware generation - it evolves naturally with new interconnects.

What the numbers show about model weaknesses

Zero-shot, the best model (GPT-5.5) solves 28 problems, with 22 faster than the baseline. Sampling three attempts (pass@3) pushes best solutions to 36 correct, 27 faster. That's a 31% fast-1@3 rate.

Breakdown by parallelism type tells you where models struggle. Context parallel tasks saw the most success: GPT-5.5 got 7 of 12 correct at pass@1, and 5 of those beat the baseline. Tensor parallel was a disaster: 2 of 17 correct. Pipeline parallel got zero across all models. Expert parallel (11 problems) produced 3 correct from GPT-5.5, but none faster than baseline.

DeepSeek V4 Pro and GLM-5.x performed worse across the board. Gemini 3 Pro matched GPT-5.5 on collective primitives (6 of 8 at pass@3) but fell off on any parallelism type requiring careful synchronization.

A few surprising wins hint at the path forward

Not everything is grim. A handful of generated kernels are faster than anything publicly available. One for NVIDIA NeMo-RL's GRPO training loop has no prior optimized reference and the LLM produced a faster solution than the existing implementation.

This suggests models can sometimes escape local minima that human kernel writers get stuck in. The benchmark doesn't just measure correctness - it measures whether the model can find a better point in the design space that fuses communication with compute or chooses a smarter data movement strategy.

The gap between zero-shot and pass@3 results shows that sampling helps, but the ceiling remains low. The authors don't claim the benchmark is easy; they built it to expose where current models fall short and to drive progress. Next step: models that can reason about interconnect topology and choose between TMA, NVLS, and SM load/store based on problem shape and hardware constraints.


Source: ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)
Domain: together.ai

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.