Source linked

Gaussian Mixture Attention remplace les comparaisons par paires avec K Latent Slots

Une nouvelle variante d'attention permet d'atteindre la mémoire O(NK) et l'évolutivité du temps en routageant les requêtes et les clés à travers les composants Gaussian K appris, correspondant à la précision de la ligne de base sur la classification de long contexte tout en évitant le NxN.

gaussian mixture attentiontransformerslinear attentionlong context modelsprobabilistic routingmachine learning

Standard dot-product attention's $N \times N$ affinity matrix becomes a brick wall past 8k tokens. Gaussian Mixture Attention (GMA) smashes that wall by routing tokens through $K$ learned Gaussian components, collapsing the quadratic cost to $O(NK)$ with fixed $K$.

How GMA Works: Routing Through K Gaussian Components

Instead of comparing every query with every key, GMA maps both to posterior responsibility vectors over a shared latent space of $K$ mixture components. The overlap of those responsibility vectors defines token-to-token affinity implicitly. Values are written into a $K$-slot latent memory and read out via the same routing. Matrix multiplication's associativity means the $N \times N$ matrix never gets materialized; the dominant storage is the two $N \times K$ responsibility matrices. The authors formulate both bidirectional and causal variants with an end-to-end differentiable parameterization of the Gaussian components.

Causal GMA vs. Optimized Baselines on WikiText-103

Empirical results confirm the linear memory scaling. On long-context classification, GMA is competitive with standard attention-style baselines. Causal GMA beats all tested linear/random-feature attention variants on WikiText-103 perplexity. It does not yet beat optimized causal SDPA (softmax dot-product attention) or Mamba in this implementation. Analysis of learned responsibilities shows broad component usage and moderate alignment with surface-form token categories, suggesting the routing is interpretable without being a hard cluster assignment.

Why This Matters for Long-Context Transformers

GMA is not a universal replacement for softmax attention or state-space models - the authors are explicit about that. But it offers a principled, probabilistic alternative that keeps the attention-style formulation while breaking the quadratic memory wall. The $K$-slot latent memory gives a concrete knob to trade off expressiveness against compute. Future work could tighten causal GMA's performance gap with Mamba and SDPA, or extend the routing to hierarchical mixtures. For now, GMA proves that linear-time attention doesn't have to abandon the query-key-value paradigm.


Source: Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.