Source linked

Blackwell GPU Confidential Computing Costs LLM Servers 27% Throughput - Here's Why

arxiv.org@wild_condoryesterday·Artificial Intelligence·1 comments

A new study pinpoints the serialized VM-GPU bridge as the culprit behind 13-27% throughput loss and doubled KV-cache restore latency on NVIDIA B300 and RTX Pro 6000, with a worker-thread drain recovering up to 92% of...

nvidiablackwellgpu confidential computingllm servingvllmintel tdx

13-27% throughput loss and more than double KV-cache restore latency hit LLM serving on NVIDIA B300 and RTX Pro 6000 under GPU Confidential Computing - but the culprit isn't the GPU compute itself.

The Bridge, Not the Chip

BF16 matmul on B300 runs at 0.998x its non-confidential speed. Compute is fine. The bottleneck is the confidential VM-GPU bridge: a serialized, high-setup-cost channel that turns host/device movement into a nightmare. Secure copies lose CUDA-stream concurrency within a context, asynchronous transfers block at the runtime boundary, and every small crossing pays a fixed toll. In vLLM dense decode, small alloc-and-copy operations are 44x slower than non-confidential baselines. That violates every assumption modern inference runtimes make about DMA being cheap, concurrent, and asynchronous.

Recovering 92% of the Gap

The paper identifies two targeted mitigations. A scheduling flag recovers 57% of the throughput loss. A worker-thread drain recovers up to 92% in qualified high-concurrency runs. The same bridge model explains a +131% penalty on KV-restore latency and a 34x slowdown on model loading. Blackwell also changes the confidential tenancy unit: the researchers qualify confidential multi-GPU NVSwitch tenants on B300, including 510 GB/s NVLink P2P inside a CVM and concurrent isolated tenants.

The remaining fabric-attestation gap for production confidential AI platforms is the next target - and now we know exactly which bridge to rebuild.

Source: The Serialized Bridge: Understanding and Recovering LLM Serving Performance under Blackwell GPU Confidential Computing
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Artificial Intelligence

view topic

On-Device NAS Runs on Pi 4, Beats SOTA Using 37% Less RAM

A new on-device neural architecture search adapts models to individual users in real time, achieving 5.96 percentage points higher accuracy on sEMG sign language recognition while using 37% less RAM on a Raspberry Pi 4.

LLM Manipulation Is Task-Dependent: Spearman ρ = 0.055 Across Environments

Six frontier models were tested across 13,590 scenarios. The average rank correlation between manipulation rates in different tasks is just 0.055, meaning a model that lies in negotiations might stay honest in reasoning.

125 Wikipedia Edits Tilt Llama 8B Outputs on Animal Welfare

Pro-Animal Wikipedians made just 125 edits across 115 pages; gradient-based attribution shows 68% of top documents for animal welfare queries come from those edits, and fine-tuned models drop perplexity from 12.4 to 8.4.

Curvature-Guided Mixing Solves Catastrophic Forgetting in MLLMs

New CGM method uses Hessian approximations to derive an optimal blending ratio, preserving general knowledge while specializing for downstream tasks.

Comments load interactively on the live page.