Source linked

cuTile Rust Hits 92% of B200 Peak with Zero Safety Overhead

NVlabs' safe Rust DSL for tile-based GPU kernels delivers 7 TB/s memory bandwidth and 2 PFlop/s GEMM within 0.3% of hand-tuned low-level code, while eliminating data races and use-after-free.

nvlabscutile rustrustgpu programmingcudamemory safety

NVlabs released cuTile Rust, a tile-based kernel DSL that gives you Rust's full ownership discipline on the GPU. On an NVIDIA B200, it hits 7 TB/s for element-wise ops and 2 PFlop/s for GEMM that's 92% of the dense f16 peak. And here's the punchline: the safe Rust version runs within 0.3% of the low-level Tile IR baseline.

Safe Concurrency That Doesn't Cost a Cycle cuTile Rust leverages the same borrow checker you know from CPU code. The # macro captures a Rust AST, then JIT-compiles it through CUDA Tile IR into a cubin. Mutable tensors get partitioned into disjoint chunks at launch time, so two tiles never write to overlapping memory. Immutable tensors are shared read-only. The compiler enforces all of it -- no data races, no use-after-free, no silent GPU crashes. Safety overhead? The paper's microbenchmarks measured persistent GEMM at M=N=K=8192: 2.07 PFlop/s with the safe Rust path versus 2.07 PFlop/s with raw Tile IR. That's 92% of the B200's dense f16 peak for both, with the safe version landing 0.3% lower. You can have your safety and eat it too.

Real Inference: 171 Tokens/s for Qwen3 on an RTX 5090 The paper also evaluates Grout, a Qwen3 inference engine built with cuTile Rust in collaboration with Hugging Face. In batch-1 decode, Grout pushes 171 tokens/s for Qwen3-4B on an NVIDIA GeForce RTX 5090. On B200, the 32B variant hits 82 tokens/s. Those numbers compete with hand-optimized CUDA inference stacks, but built entirely in safe Rust with the same ownership guarantees.

What This Means for the Next Wave of GPU Programming cuTile Rust proves that a high-level, safe DSL can match bare-metal performance on the most demanding GPU primitive (GEMM). The next step is obvious: expand the tile library, nail down the API for production use, and let the Rust ecosystem write GPU kernels the same way it writes CPU code -- with confidence that the compiler won't let you corrupt memory or introduce a race condition. NVlabs has open-sourced the full stack, benchmarks, and reproducibility artifacts. Go run the examples yourself.


Source: Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust
Domain: github.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.