Source linked

DiffusionGemma Cranks 1000+ Tokens/sec With 256-Token Parallel Blocks

Google's new open experimental model uses text diffusion to generate entire blocks simultaneously, achieving 1000+ tokens per second on an H100 while fitting in 18GB VRAM.

diffusiongemmagooglegemma 4h100mixture of expertstext diffusion

1000 tokens per second on a single NVIDIA H100 — that’s what DiffusionGemma delivers by throwing out token-by-token autoregressive generation in favor of parallel 256-token blocks.

Google released DiffusionGemma as an experimental open model under Apache 2.0. It's a 26B Mixture of Experts model that activates only 3.8B parameters per inference pass. That puts it comfortably under 18GB VRAM when quantized, meaning it runs on high-end consumer GPUs like the RTX 5090 (700+ tokens/sec there).

Shifting the Bottleneck from Memory to Compute

Most LLMs behave like a typewriter — one token at a time, left to right. On a dedicated local GPU that’s terrible utilization: the hardware spends most cycles waiting for the next keystroke. DiffusionGemma instead drafts an entire 256-token paragraph in a single shot. That shifts the decode bottleneck from memory-bandwidth to compute, letting the GPU run flat out.

The model uses a novel diffusion head built on top of Gemma 4’s architecture and Gemini Diffusion research. Instead of predicting tokens sequentially, it starts with a canvas of random placeholder tokens and iteratively refines them — exactly how image diffusion models work, but for text. Bi-directional attention means every token in the block can attend to every other token, which opens up non-linear generation patterns.

Where Diffusion Wins and Where It Doesn’t

Speed gain is real, but it’s not universal. DiffusionGemma’s 4x advantage is designed for low-concurrency local inference — a single user on a single accelerator. In high-QPS cloud serving, autoregressive models can batch thousands of requests and saturate compute just fine. There, parallel decoding offers diminishing returns and can even increase serving costs.

Output quality also takes a hit. Google is straight about this: standard Gemma 4 remains the pick for production-grade quality. DiffusionGemma trades some coherence for speed and parallel layout generation. You can recover quality through fine-tuning, which is exactly what Unsloth did — they fine-tuned it to play Sudoku, a task autoregressive models choke on because each token depends on future tokens. Bi-directional attention makes that trivial.

Practical Trade-offs and the Sudoku Surprise

Sudoku isn’t a gimmick. It demonstrates that DiffusionGemma can handle tasks requiring global coherence and non-sequential reasoning — in-line code editing, amino acid sequences, mathematical graphs. For developers building real-time interactive local tools like live code infill or collaborative editing, that’s the real win.

Google ships the weights on Hugging Face right now. Serve it with vLLM or MLX. The developer guide and a visual explainer are up on the same blog post. For engineers tired of waiting for autoregressive models to finish a sentence on local hardware, this is worth a weekend of hacking.

DiffusionGemma won’t replace Gemini or even standard Gemma 4 for cloud workloads. But for speed-critical, low-concurrency local workflows, it changes what’s possible on a single consumer GPU.


Source: DiffusionGemma: 4x Faster Text Generation
Domain: blog.google

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.