NVIDIA Blackwell Sweeps MLPerf with 1.6x Speedup and 8,192-GPU Scale

NVIDIA's Blackwell platform crushed MLPerf Training 6.0: fastest time on every benchmark, largest scale at 8,192 GPUs, and the only platform to enter all seven suites.

MoE Training Tames All-to-All Communication

MLPerf Training 6.0 added two mixture-of-experts pretraining workloads - DeepSeek-V3 671B and GPT-OSS-20B - reflecting the industry's shift toward MoE architectures. NVIDIA's GB200 NVL72 and GB300 NVL72 rack-scale systems handle the all-to-all communication challenge by using fifth-generation NVLink Switches to connect all 72 GPUs into a unified compute and memory pool. That bandwidth advantage turns a potential bottleneck into a speed asset.

GB300 NVL72 delivered up to 1.6x faster training than GB200 NVL72 at the same scale, driven by higher compute density with NVFP4, expanded memory capacity, and a higher power ceiling that keeps the GPU at peak performance. NVIDIA also showcased NVFP4 training methods that pretrained the 550-billion-parameter Nemotron 3 Ultra model while meeting strict accuracy requirements.

Scale and Reliability Built for Production

NVIDIA scaled its DeepSeek-V3 671B submission to 8,192 GPUs using GB200 NVL72 systems - the largest Blackwell-based cluster in MLPerf Training history. On Llama 3.1 405B, one of the largest dense LLMs in the suite, NVIDIA submitted results at 5,120 GPUs. Microsoft Azure reached the Llama 3.1 405B reference quality target in 7.07 minutes using 8,192 GPUs - the fastest time for that benchmark. CoreWeave hit 2.02 minutes on DeepSeek-V3 671B at 8,192-GPU scale with GB300 NVL72 and Spectrum-X Ethernet.

Production reliability gets equal attention. NVIDIA screens each GPU across 30+ manufacturing test stages. An on-chip Reliability, Availability and Serviceability Engine monitors nearly the entire die, with self-healing logic that routes around faults without interrupting workloads. At the network level, Spectrum-X Ethernet reroutes around failed links in milliseconds. NVIDIA's Resiliency Extension (NVRx) automatically detects underperforming nodes and recovers from interruptions via checkpoints rather than restarting entire jobs.

Partners Ship Faster Training Now

Nineteen organizations submitted results on NVIDIA infrastructure. Cohere got 3x faster training on GB200 NVL72 for its North agentic AI platform. Midjourney trained its v8 image generation model on a Blackwell cluster and is now scaling a large fleet of Blackwell Ultra GPUs on CoreWeave for upcoming image and video models. Thinking Machines Lab saw 2x faster training and serving on Google Cloud's GB300 NVL72 compared with prior-generation GPUs. Higgsfield reduced model training time by 30% on Nebius Blackwell infrastructure, supporting 22 million users and over 6 million pieces of AI content per day.

Frontier model builders now have a clear reference: NVIDIA Blackwell delivers the fastest training times at the largest scale with production-grade reliability - and the gap is widening.

Source: Fastest, Largest, Strongest: NVIDIA Blackwell Sweeps MLPerf Training 6.0
Domain: blogs.nvidia.com

NVIDIA Blackwell Sweeps MLPerf with 1.6x Speedup and 8,192-GPU Scale

MoE Training Tames All-to-All Communication

Scale and Reliability Built for Production

Partners Ship Faster Training Now

More in Artificial Intelligence