Source linked

NVIDIA Blackwell Inference Stack Cuts Token Cost 5x in One Month

Combining disaggregated serving, NVFP4, and multi-token prediction compounds throughput up to 20x per GPU - and DeepSeek V4 costs dropped 5x in 30 days.

nvidiablackwelltensorrt llmdynamoinferencelarge language models

NVIDIA's Blackwell inference software stack already dropped token costs on DeepSeek V4 by 5x in one month, and that's just the starting point. SemiAnalysis InferenceX numbers for the GB300 NVL72 show that stacking disaggregated serving, large expert parallelism over NVLink, NVFP4 precision, and multi-token prediction lifts throughput per GPU by 20x versus a bare baseline. That's not theoretical — Baseten shipped TensorRT-LLM optimizations on Blackwell and delivered up to 50% more tokens per second for reasoning and long-context workloads.

5x Cost Cut in a Month—And That’s Just the Start

Agentic AI doesn't behave like traditional web workloads. A single user request fans out into hundreds of subagents, thousands of tasks, multiple LLMs, tool calls, and state management spread across GPUs, CPUs, DPUs, and storage. The software stack decides whether that complexity wastes capacity or slashes cost per token. NVIDIA's full-stack approach coordinates three layers: production operation (Dynamo handles distributed serving, orchestration, and autoscaling), application acceleration (TensorRT-LLM overlaps compute and communication, fuses kernels), and infrastructure access (exposes GPU, networking, and memory capabilities without forcing developers to touch device instructions).

Why Open Source Makes the Compounding Possible

PyTorch launched in 2016 with native CUDA support, and that co-evolution means innovations like DFlash speculative decode (up to 15x more throughput on existing hardware) or FastVideo (1080p video in under 5 seconds) run instantly on Blackwell because the frameworks are built on CUDA from day one. When DeepSeek V4 dropped, vLLM and SGLang had day-zero deployment recipes for Blackwell—and both frameworks saw performance improve 5x in roughly a month as the community fed production learnings back into the code. Cognition uses Dynamo to manage inference GPUs for reinforcement learning without building infrastructure from scratch; Deep Infra serves frontier open models on Blackwell from release hour.

Individual optimizations compound only when the stack is designed as one system. Disaggregated serving, NVFP4, large expert parallelism—each delivers meaningful gains alone, but combined they hit that 20x throughput multiplier on GB200 NVL72 systems. The open-source flywheel ensures every new model release and kernel optimization gets to production faster, and every deployment feeds back into lower cost per token. Expect the gap between chip specs and delivered token economics to widen further as Dynamo and TensorRT-LLM absorb more production patterns from companies like Together AI and Cursor.


Source: How NVIDIA's Inference Software Stack Powers the Lowest Token Cost
Domain: blogs.nvidia.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.