Standard dot-product attention's $N \times N$ affinity matrix becomes a brick wall past 8k tokens. Gaussian Mixture Attention (GMA) smashes that wall by routing tokens through $K$ learned Gaussian components, collapsing the quadratic cost to $O(NK)$ with fixed $K$.
How GMA Works: Routing Through K Gaussian Components
Instead of comparing every query with every key, GMA maps both to posterior responsibility vectors over a shared latent space of $K$ mixture components. The overlap of those responsibility vectors defines token-to-token affinity implicitly. Values are written into a $K$-slot latent memory and read out via the same routing. Matrix multiplication's associativity means the $N \times N$ matrix never gets materialized; the dominant storage is the two $N \times K$ responsibility matrices. The authors formulate both bidirectional and causal variants with an end-to-end differentiable parameterization of the Gaussian components.
Causal GMA vs. Optimized Baselines on WikiText-103
Empirical results confirm the linear memory scaling. On long-context classification, GMA is competitive with standard attention-style baselines. Causal GMA beats all tested linear/random-feature attention variants on WikiText-103 perplexity. It does not yet beat optimized causal SDPA (softmax dot-product attention) or Mamba in this implementation. Analysis of learned responsibilities shows broad component usage and moderate alignment with surface-form token categories, suggesting the routing is interpretable without being a hard cluster assignment.
Why This Matters for Long-Context Transformers
GMA is not a universal replacement for softmax attention or state-space models - the authors are explicit about that. But it offers a principled, probabilistic alternative that keeps the attention-style formulation while breaking the quadratic memory wall. The $K$-slot latent memory gives a concrete knob to trade off expressiveness against compute. Future work could tighten causal GMA's performance gap with Mamba and SDPA, or extend the routing to hierarchical mixtures. For now, GMA proves that linear-time attention doesn't have to abandon the query-key-value paradigm.
Source: Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing
Domain: arxiv.org
Comments load interactively on the live page.