Source linked

El nn.Linear de PyTorch ya fusiona bias en GEMM - No se necesita compilación

huggingface.co@frontier_wire3 hours ago·Machine Learning·0 comments

Una única llamada nn.Linear utiliza cuBLAS addmm, plegando los vicios en el epilogo del núcleo - torch.compile no tiene nada que fusionar.

pytorchnvidiacublastorch compilegpu profilingmachine learning

Bias addition in PyTorch's nn.Linear doesn't run as a separate kernel — it's folded directly into the matrix multiplication via cuBLAS addmm. That single choice by the PyTorch team means torch.compile has exactly nothing to fuse on a single linear layer.

The Transpose Is a CPU Metadata Trick, Not a GPU Kernel

Running 02_linear.py with batch 1024, in_dim 32, out_dim 64 on an A100-SXM4-80GB shows an aten::t op right before aten::addmm in the profiler trace. But aten::t never lands on the GPU lane — it only rewrites tensor shape and stride on the CPU. No data copy, no kernel launch. The weight matrix stays in its original layout; PyTorch just tricks the GEMM into reading it as transposed.

Check the profiler table and you'll see zero CUDA time for the transpose. That's free.

Why There's No Separate Add Kernel

nn.Linear calls torch.nn.functional.linear, which dispatches to aten::linear. That op inspects the bias argument and routes straight to aten::addmm(bias, x, weight). The addmm kernel, provided by cuBLAS, computes out = x @ weight.T + bias in one shot. The bias addition is an epilogue — a small computation the GEMM kernel runs just before writing results back to HBM. No second memory round-trip, no separate kernel.

This is the same addmm kernel you'd get if you wrote torch.add(torch.matmul(x, w), b) and compiled it with torch.compile in Part 1 of this series. Eager mode already uses the fused variant.

Compile Can't Improve What's Already Fused

Profiling the compiled version of the same single linear layer reveals the exact same cuBLAS GEMM kernel on the GPU, the same aten::addmm on the CPU, plus a few extra compile-internal CPU rows. Zero kernel fusion gain. torch.compile needs at least two separable operations to stitch together — a standalone linear with bias is already a single operation from the GPU's perspective.

The reflex to slap torch.compile on everything is misguided here. For an MLP block with three linear layers and activations, the story changes — that's exactly where compile can fuse the sequence of epilogues and activations into a single kernel. The blog's Part 3 will walk that proof.


Source: Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP
Domain: huggingface.co

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.