DiffusionGemma يخلق النص بطريقة متوازنة ، يغطي نموذجات التحرش الذاتي 4x

Google DeepMind’s DiffusionGemma hits over 1,000 tokens per second on a single Nvidia H100—four times faster than the autoregressive Gemma models of similar size.

Why Parallel Generation Matters for Local Inference

Most large language models are autoregressive: they emit one token at a time, left to right, each step dependent on the previous one. DiffusionGemma breaks that mold. Borrowing the denoising technique from image generation models like Stable Diffusion, it starts with a field of placeholder tokens and iteratively refines the entire canvas in parallel. The result is a complete block of text produced in a fraction of the sequential wall-clock time.

For anyone running AI on local hardware, this is the difference between waiting for a response and getting one immediately. With only 3.8 billion of its 26 billion total parameters activated per inference—thanks to a Mixture of Experts architecture—it fits inside 18GB of GPU RAM. On an RTX 5090, that means 700 tokens per second. On an H100, it crosses 1,000.

Architecture Details You Actually Care About

DiffusionGemma is part of the Gemma 4 open model family, but it’s nothing like its siblings. Instead of predicting the next token, the model runs multiple passes over the latent text canvas, using each pass to improve its estimates for all tokens simultaneously. By the final step, the entire output has been “denoised” into coherent text.

This approach flips the fundamental latency trade-off. Autoregressive models scale inference time linearly with output length; DiffusionGemma’s cost is roughly constant for a given canvas size. Longer outputs become dramatically cheaper, especially for batch or streaming use cases.

What This Enables Next

Google DeepMind hasn’t published benchmark scores for quality, but the raw throughput numbers alone make DiffusionGemma a candidate for real-time applications—chat, code completion, and writing assistants running on consumer GPUs. The model is openly available, so expect the community to start stress-testing it against autoregressive baselines on perplexity and downstream tasks this week. If parallel text generation holds up on quality, the next generation of local AI hardware might stop caring about token-by-token latency entirely.

Source: Google's latest DiffusionGemma open AI model comes with a 4x speed boost
Domain: arstechnica.com

DiffusionGemma يخلق النص بطريقة متوازنة ، يغطي نموذجات التحرش الذاتي 4x

Why Parallel Generation Matters for Local Inference

Architecture Details You Actually Care About

What This Enables Next

More in Artificial Intelligence