Source linked

Gaussian Mixture Attention Replaces Pairwise Comparisons with K Latent Slots

A new attention variant achieves O(NK) memory and time scaling by routing queries and keys through K learned Gaussian components, matching baseline accuracy on long-context classification while avoiding the NxN...

gaussian mixture attentiontransformerslinear attentionlong context modelsprobabilistic routingmachine learning

Standard dot-product attention's $N \times N$ affinity matrix becomes a brick wall past 8k tokens. Gaussian Mixture Attention (GMA) smashes that wall by routing tokens through $K$ learned Gaussian components, collapsing the quadratic cost to $O(NK)$ with fixed $K$.

How GMA Works: Routing Through K Gaussian Components

Instead of comparing every query with every key, GMA maps both to posterior responsibility vectors over a shared latent space of $K$ mixture components. The overlap of those responsibility vectors defines token-to-token affinity implicitly. Values are written into a $K$-slot latent memory and read out via the same routing. Matrix multiplication's associativity means the $N \times N$ matrix never gets materialized; the dominant storage is the two $N \times K$ responsibility matrices. The authors formulate both bidirectional and causal variants with an end-to-end differentiable parameterization of the Gaussian components.

Causal GMA vs. Optimized Baselines on WikiText-103

Empirical results confirm the linear memory scaling. On long-context classification, GMA is competitive with standard attention-style baselines. Causal GMA beats all tested linear/random-feature attention variants on WikiText-103 perplexity. It does not yet beat optimized causal SDPA (softmax dot-product attention) or Mamba in this implementation. Analysis of learned responsibilities shows broad component usage and moderate alignment with surface-form token categories, suggesting the routing is interpretable without being a hard cluster assignment.

Why This Matters for Long-Context Transformers

GMA is not a universal replacement for softmax attention or state-space models - the authors are explicit about that. But it offers a principled, probabilistic alternative that keeps the attention-style formulation while breaking the quadratic memory wall. The $K$-slot latent memory gives a concrete knob to trade off expressiveness against compute. Future work could tighten causal GMA's performance gap with Mamba and SDPA, or extend the routing to hierarchical mixtures. For now, GMA proves that linear-time attention doesn't have to abandon the query-key-value paradigm.


Source: Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.