Source linked

Blackwell GPU Confidential Computing Costs LLM Servers 27% Throughput - Here's Why

A new study pinpoints the serialized VM-GPU bridge as the culprit behind 13-27% throughput loss and doubled KV-cache restore latency on NVIDIA B300 and RTX Pro 6000, with a worker-thread drain recovering up to 92% of...

nvidiablackwellgpu confidential computingllm servingvllmintel tdx

13-27% throughput loss and more than double KV-cache restore latency hit LLM serving on NVIDIA B300 and RTX Pro 6000 under GPU Confidential Computing - but the culprit isn't the GPU compute itself.

The Bridge, Not the Chip

BF16 matmul on B300 runs at 0.998x its non-confidential speed. Compute is fine. The bottleneck is the confidential VM-GPU bridge: a serialized, high-setup-cost channel that turns host/device movement into a nightmare. Secure copies lose CUDA-stream concurrency within a context, asynchronous transfers block at the runtime boundary, and every small crossing pays a fixed toll. In vLLM dense decode, small alloc-and-copy operations are 44x slower than non-confidential baselines. That violates every assumption modern inference runtimes make about DMA being cheap, concurrent, and asynchronous.

Recovering 92% of the Gap

The paper identifies two targeted mitigations. A scheduling flag recovers 57% of the throughput loss. A worker-thread drain recovers up to 92% in qualified high-concurrency runs. The same bridge model explains a +131% penalty on KV-restore latency and a 34x slowdown on model loading. Blackwell also changes the confidential tenancy unit: the researchers qualify confidential multi-GPU NVSwitch tenants on B300, including 510 GB/s NVLink P2P inside a CVM and concurrent isolated tenants.

The remaining fabric-attestation gap for production confidential AI platforms is the next target - and now we know exactly which bridge to rebuild.


Source: The Serialized Bridge: Understanding and Recovering LLM Serving Performance under Blackwell GPU Confidential Computing
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.