20x more agents per megawatt than Hopper. That's the headline from the first round of AgentPerf, an infrastructure benchmark built by Artificial Analysis specifically for agentic AI workloads—not chat, not single-LLM calls, but the chained, tool-calling, context-growing relay race that actual agents run.
Why Agentic AI Needs Its Own Benchmark
Chat completions are sprints: one LLM call, one response. Agents are multi-stage relays: dozens to hundreds of chained LLM calls, each passing growing context, punctuated by tool calls—code compile, database search, web browsing. The complexity isn't additive; it's multiplicative. Existing inference benchmarks measure how fast a single LLM responds and how many simultaneous requests a system can handle. They break on agentic workloads where tool-call latency and context growth stress memory bandwidth and interconnect in completely different ways.
Measured with DeepSeek V4 Pro on Real Coding Trajectories
AgentPerf replays real coding agent trajectories drawn from public repositories across 12+ programming languages. Agents read files, edit code, execute commands, and iterate—all simulated with representative CPU tool-call delays to isolate accelerator performance. The result: a clear metric of how many concurrent agent tasks a platform can sustain at defined output-token rates (20 and 60 tokens/sec per agent). On this workload with DeepSeek V4 Pro—a large MoE model representative of frontier agents—NVIDIA's GB300 NVL72 delivered the highest performance, running up to 20x more agents per megawatt than the HGX H200 system.
Full-Stack Codesign, Not Just Faster GPUs
That 20x advantage comes from extreme codesign across the stack. GB300 NVL72 connects 72 GPUs into a single rack-scale system, letting MoE models like DeepSeek V4 Pro distribute expert execution efficiently. CUDA kernels overlap communication and compute, absorbing coordination cost instead of adding it to latency. TensorRT-LLM separates input processing from output generation so each can be independently optimized as concurrent sessions scale. The result isn't just raw throughput—it's per-watt agent density that translates directly into infrastructure ROI.
Who's Already Using This
Inference providers like Baseten, DeepInfra, and Together AI are already running DeepSeek V4 Pro on Blackwell for production agentic applications. Together AI powers Cursor's agentic coding platform on Blackwell; DeepInfra runs Pam.ai (AI workforce for car dealerships) on the same hardware. These aren't benchmarks—they're deployments.
NVIDIA's Vera Rubin architecture is now in full production, meaning the next generation of agentic infrastructure is already shipping. Expect per-watt agent counts to climb further as the open-source ecosystem continues optimizing inference software for this new workload class.
Source: NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark
Domain: blogs.nvidia.com
Comments load interactively on the live page.