Source linked

Blackwell GPU Computing confidentiel coûte des serveurs LLM 27% de débit - voici pourquoi

Une nouvelle étude identifie le pont VM-GPU sérialisé comme le coupable de la perte de débit de 13 à 27 % et du doublement de la latence de récupération de la cache KV sur les NVIDIA B300 et RTX Pro 6000, avec un drainage de fil de travailleur récupérant jusqu'à 92 % de la...

nvidiablackwellgpu confidential computingllm servingvllmintel tdx

13-27% throughput loss and more than double KV-cache restore latency hit LLM serving on NVIDIA B300 and RTX Pro 6000 under GPU Confidential Computing - but the culprit isn't the GPU compute itself.

The Bridge, Not the Chip

BF16 matmul on B300 runs at 0.998x its non-confidential speed. Compute is fine. The bottleneck is the confidential VM-GPU bridge: a serialized, high-setup-cost channel that turns host/device movement into a nightmare. Secure copies lose CUDA-stream concurrency within a context, asynchronous transfers block at the runtime boundary, and every small crossing pays a fixed toll. In vLLM dense decode, small alloc-and-copy operations are 44x slower than non-confidential baselines. That violates every assumption modern inference runtimes make about DMA being cheap, concurrent, and asynchronous.

Recovering 92% of the Gap

The paper identifies two targeted mitigations. A scheduling flag recovers 57% of the throughput loss. A worker-thread drain recovers up to 92% in qualified high-concurrency runs. The same bridge model explains a +131% penalty on KV-restore latency and a 34x slowdown on model loading. Blackwell also changes the confidential tenancy unit: the researchers qualify confidential multi-GPU NVSwitch tenants on B300, including 510 GB/s NVLink P2P inside a CVM and concurrent isolated tenants.

The remaining fabric-attestation gap for production confidential AI platforms is the next target - and now we know exactly which bridge to rebuild.


Source: The Serialized Bridge: Understanding and Recovering LLM Serving Performance under Blackwell GPU Confidential Computing
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.