Source linked

Computación confidencial de GPU de Blackwell cuesta los servidores LLM un 27% de rendimiento - aquí está por qué

arxiv.org@wild_condor2 days ago·Artificial Intelligence·1 comments

Un nuevo estudio identifica el puente de VM-GPU serializado como el culpable detrás de la pérdida de 13-27% de rendimiento y el doble de la latencia de restauración de KV-cache en NVIDIA B300 y RTX Pro 6000, con un drenaje de hilo de trabajador que recupera hasta el 92% de la velocidad de recuperación.

nvidiablackwellgpu confidential computingllm servingvllmintel tdx

13-27% throughput loss and more than double KV-cache restore latency hit LLM serving on NVIDIA B300 and RTX Pro 6000 under GPU Confidential Computing - but the culprit isn't the GPU compute itself.

The Bridge, Not the Chip

BF16 matmul on B300 runs at 0.998x its non-confidential speed. Compute is fine. The bottleneck is the confidential VM-GPU bridge: a serialized, high-setup-cost channel that turns host/device movement into a nightmare. Secure copies lose CUDA-stream concurrency within a context, asynchronous transfers block at the runtime boundary, and every small crossing pays a fixed toll. In vLLM dense decode, small alloc-and-copy operations are 44x slower than non-confidential baselines. That violates every assumption modern inference runtimes make about DMA being cheap, concurrent, and asynchronous.

Recovering 92% of the Gap

The paper identifies two targeted mitigations. A scheduling flag recovers 57% of the throughput loss. A worker-thread drain recovers up to 92% in qualified high-concurrency runs. The same bridge model explains a +131% penalty on KV-restore latency and a 34x slowdown on model loading. Blackwell also changes the confidential tenancy unit: the researchers qualify confidential multi-GPU NVSwitch tenants on B300, including 510 GB/s NVLink P2P inside a CVM and concurrent isolated tenants.

The remaining fabric-attestation gap for production confidential AI platforms is the next target - and now we know exactly which bridge to rebuild.

Source: The Serialized Bridge: Understanding and Recovering LLM Serving Performance under Blackwell GPU Confidential Computing
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Artificial Intelligence

view topic

Cara's Domain-Specific AI Saves Insurance Brokers 10 Hours a Week on AWS

Cara's AI-native platform on AWS saves insurance brokers 10 hours per week per user, onboarding enterprises in hours.

Five reasons frontier AI pricing is about to collapse

GPT 5.5 costs $5/$30 per million tokens, but open-weight GLM-5.2 beats it at 1/10th the price. Here's why the gap won't last.

Persona Steering Suppresses Refusal in Llama and Qwen Models

In Llama-3.1-8B-Instruct, steering a compliant persona drops refusal rate from 97% to 2% - refusal is gated downstream at late-layer expression.

Cascading Linear Features Expose Sycophancy and Let You Steer LLMs

A new iterative data-generation pipeline isolates linearly separable features for sycophancy, enabling detection and steering that matches or beats LLM-as-a-judge with less compute and full interpretability.

Comments load interactively on the live page.