R2LM's Asymmetric Context Hits 12.9x Throughput Over Bidirectional Diffusion

2.4× to 12.9× higher throughput than bidirectional discrete diffusion LMs in batch serving — that's the headline number from R2LM, and it fixes a design split that's been holding diffusion language models back from production deployment.

If you've worked with discrete diffusion LMs, you know the frustration: bidirectional attention delivers strong generation quality by letting each token see the full unmasked context, but it's incompatible with KV caching, so batch inference throughput tanks. Causal attention supports efficient cached inference, but it can't see any tokens to the right, and quality suffers substantially. Pick your poison.

The Bidirectional vs. Causal Tradeoff That Was Killing Diffusion LMs

The core problem is structural: discrete diffusion LMs (dLLMs) recover masked tokens in parallel, offering huge speedups over autoregressive generation in theory. In practice, the architecture choice between bidirectional and causal attention forces a hard quality-vs-speed tradeoff. Bidirectional models lose KV caching, making them slow under batch-serving workloads. Causal models sacrifice right-side context, degrading generation quality — especially for tasks that benefit from bidirectional information.

No previous approach escaped this dilemma cleanly. The paper's authors describe it as a "fundamental architectural design dilemma" and set out to break it with asymmetric bidirectional context.

R2LM's Asymmetric Sidecar: Causal Attention + Reverse Mamba

R2LM — Right-to-Left Mamba — is the concrete instantiation of the Bifocal dLLM paradigm. It uses two complementary mechanisms: standard causal attention provides precise left-context with full KV cache compatibility, while a lightweight reverse Mamba SSM sidecar supplies compressed right-side context without breaking cacheability.

Think of it as bifocal lenses: one optical path handles the near field (left context) with high precision; another handles the far field (right context) with a compressed representation. The reverse Mamba operates only on the right-side tokens, compressing them into a small state that feeds into the causal attention stream without adding quadratic compute. The result is a model that keeps KV caching intact while gaining access to right-side information.

Real Numbers: 60B Tokens on Qwen3-1.7B Prove the Point

The team continued pretraining a Qwen3-1.7B model on 60 billion tokens to evaluate R2LM against both causal and bidirectional baselines. In batch-serving scenarios, R2LM achieves 2.4× to 12.9× higher throughput than bidirectional dLLMs and 1.9× to 2.9× speedup over autoregressive baselines — all while exceeding the causal baseline on most benchmarks and surpassing the bidirectional dLLM on average.

Those throughput gains come directly from parallel decoding with KV caching, which bidirectional models cannot use. The Mamba sidecar adds negligible compute cost and preserves the cache pipeline.

R2LM makes discrete diffusion LMs finally practical for high-throughput batch serving without sacrificing generation quality. Expect this asymmetric context pattern to show up in next-gen inference engines.

Source: Bifocal Diffusion Language Models: Asymmetric Bidirectional Context for Parallel Generation
Domain: arxiv.org

R2LM's Asymmetric Context Hits 12.9x Throughput Over Bidirectional Diffusion

The Bidirectional vs. Causal Tradeoff That Was Killing Diffusion LMs

R2LM's Asymmetric Sidecar: Causal Attention + Reverse Mamba

Real Numbers: 60B Tokens on Qwen3-1.7B Prove the Point

More in Artificial Intelligence