Les modèles de diffusion de Nemotron réduisent la latence des jetons de 6x tout en dépassant le Qwen3

Q: What is the significance of: Les modèles de diffusion de Nemotron réduisent la latence des jetons de 6x tout en dépassant le Qwen3?

Les nouveaux LLM basés sur la diffusion de NVIDIA génèrent des jetons six fois plus rapidement que les lignes de base autorégressives, améliorant la précision et permettant des révisions en cours de route.

Nemotron‑Labs Diffusion 8B produces tokens six times faster than Qwen3 8B while scoring 1.2 % higher accuracy.

Parallel Drafting Beats Autoregressive Latency

Autoregressive (AR) LLMs still dominate because they are simple to train and serve, but each token forces a full model pass and a memory‑bound weight load. On modern GPUs, that serial bottleneck leaves most compute idle. Diffusion language models (DLMs) sidestep the issue by drafting blocks of tokens in parallel and refining them over multiple steps. In practice, Nemotron‑Labs Diffusion’s diffusion mode reaches 2.6× the tokens‑per‑forward‑pass (TPF) of an AR baseline, while its self‑speculation mode pushes that to 6× (linear) and 6.4× (quadratic) with comparable accuracy.

Three Modes in One Model

The architecture unifies AR, diffusion, and self‑speculation into a single checkpoint. At deployment time, a simple flag selects the desired inference mode:

Autoregressive – left‑to‑right decoding, full backward compatibility.
Diffusion – block‑by‑block generation, iterative refinement.
Self‑speculation – diffusion drafts candidates, AR verifies them.

This design lets developers keep existing pipelines while unlocking speed‑critical workloads, even at batch‑size = 1.

Training Pathway and Practical Gains

Nemotron‑Labs Diffusion builds on Efficient‑DLM’s insight that a pretrained AR model can be converted to a diffusion model by switching to block‑wise attention and continuing pretraining. The 8B model was first trained on 1.3 T tokens from the NVIDIA Nemotron Pretraining datasets, then fine‑tuned on 45 B tokens from the Post‑training set. The joint AR‑diffusion objective preserves the AR knowledge while adding parallel drafting.

Deployment will soon be available through SGLang’s main branch, making it trivial to integrate into existing inference stacks.

Forward‑Looking

By marrying AR reliability with diffusion speed, Nemotron‑Labs Diffusion turns the token‑by‑token paradigm on its head. Developers can now run latency‑sensitive code‑generation or real‑time summarization on a single GPU, revising earlier tokens on the fly. The next step will be scaling the vision‑language 8B variant and exploring quadratic self‑speculation in production workloads.

Source: Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models
Domain: huggingface.co