13-27% throughput loss and more than double KV-cache restore latency hit LLM serving on NVIDIA B300 and RTX Pro 6000 under GPU Confidential Computing - but the culprit isn't the GPU compute itself.
The Bridge, Not the Chip
BF16 matmul on B300 runs at 0.998x its non-confidential speed. Compute is fine. The bottleneck is the confidential VM-GPU bridge: a serialized, high-setup-cost channel that turns host/device movement into a nightmare. Secure copies lose CUDA-stream concurrency within a context, asynchronous transfers block at the runtime boundary, and every small crossing pays a fixed toll. In vLLM dense decode, small alloc-and-copy operations are 44x slower than non-confidential baselines. That violates every assumption modern inference runtimes make about DMA being cheap, concurrent, and asynchronous.
Recovering 92% of the Gap
The paper identifies two targeted mitigations. A scheduling flag recovers 57% of the throughput loss. A worker-thread drain recovers up to 92% in qualified high-concurrency runs. The same bridge model explains a +131% penalty on KV-restore latency and a 34x slowdown on model loading. Blackwell also changes the confidential tenancy unit: the researchers qualify confidential multi-GPU NVSwitch tenants on B300, including 510 GB/s NVLink P2P inside a CVM and concurrent isolated tenants.
The remaining fabric-attestation gap for production confidential AI platforms is the next target - and now we know exactly which bridge to rebuild.
Source: The Serialized Bridge: Understanding and Recovering LLM Serving Performance under Blackwell GPU Confidential Computing
Domain: arxiv.org
Comments load interactively on the live page.