Source linked

Blackwell GPU Computing confidentiel coûte des serveurs LLM 27% de débit - voici pourquoi

arxiv.org@wild_condor2 days ago·Artificial Intelligence·1 comments

Une nouvelle étude identifie le pont VM-GPU sérialisé comme le coupable de la perte de débit de 13 à 27 % et du doublement de la latence de récupération de la cache KV sur les NVIDIA B300 et RTX Pro 6000, avec un drainage de fil de travailleur récupérant jusqu'à 92 % de la...

nvidiablackwellgpu confidential computingllm servingvllmintel tdx

13-27% throughput loss and more than double KV-cache restore latency hit LLM serving on NVIDIA B300 and RTX Pro 6000 under GPU Confidential Computing - but the culprit isn't the GPU compute itself.

The Bridge, Not the Chip

BF16 matmul on B300 runs at 0.998x its non-confidential speed. Compute is fine. The bottleneck is the confidential VM-GPU bridge: a serialized, high-setup-cost channel that turns host/device movement into a nightmare. Secure copies lose CUDA-stream concurrency within a context, asynchronous transfers block at the runtime boundary, and every small crossing pays a fixed toll. In vLLM dense decode, small alloc-and-copy operations are 44x slower than non-confidential baselines. That violates every assumption modern inference runtimes make about DMA being cheap, concurrent, and asynchronous.

Recovering 92% of the Gap

The paper identifies two targeted mitigations. A scheduling flag recovers 57% of the throughput loss. A worker-thread drain recovers up to 92% in qualified high-concurrency runs. The same bridge model explains a +131% penalty on KV-restore latency and a 34x slowdown on model loading. Blackwell also changes the confidential tenancy unit: the researchers qualify confidential multi-GPU NVSwitch tenants on B300, including 510 GB/s NVLink P2P inside a CVM and concurrent isolated tenants.

The remaining fabric-attestation gap for production confidential AI platforms is the next target - and now we know exactly which bridge to rebuild.

Source: The Serialized Bridge: Understanding and Recovering LLM Serving Performance under Blackwell GPU Confidential Computing
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Artificial Intelligence

view topic

Cara's Domain-Specific AI Saves Insurance Brokers 10 Hours a Week on AWS

Cara's AI-native platform on AWS saves insurance brokers 10 hours per week per user, onboarding enterprises in hours.

Five reasons frontier AI pricing is about to collapse

GPT 5.5 costs $5/$30 per million tokens, but open-weight GLM-5.2 beats it at 1/10th the price. Here's why the gap won't last.

Persona Steering Suppresses Refusal in Llama and Qwen Models

In Llama-3.1-8B-Instruct, steering a compliant persona drops refusal rate from 97% to 2% - refusal is gated downstream at late-layer expression.

Cascading Linear Features Expose Sycophancy and Let You Steer LLMs

A new iterative data-generation pipeline isolates linearly separable features for sycophancy, enabling detection and steering that matches or beats LLM-as-a-judge with less compute and full interpretability.

Comments load interactively on the live page.