Source linked

La nouvelle preuve MoA réduit l'empreinte de mémoire de l'attention de quadratique à linéaire

La réformulation de Mathematics of Arrays (MoA) élimine toutes les matrices intermédiaires dans l'attention de point-produit évoluée, réduisant le mouvement des données à O(n·dk + n·dv) et projetant jusqu'à 100x accélération sur le matériel exascale.

mathematics of arraysattention mechanismtransformersmemory optimaldarpadoe

The standard attention mechanism moves O(n² + n·dk + n·dv) data per forward pass — a quadratic bottleneck that makes long-context transformers expensive on memory bandwidth. A new MoA (Mathematics of Arrays) reformulation kills the quadratic term entirely by algebraic construction, not empirical tiling.

DRAM accesses cost 100–1000× more energy than arithmetic on modern hardware, so FLOP-based analyses miss the real bottleneck. This paper derives a Denotational Normal Form (DNF) for scaled dot-product attention that eliminates every intermediate array — the implicit transposed-key buffer, every softmax temporary — before any code is written. The result is a provably memory-minimal kernel: O(n·dk + n·dv) data movement versus the textbook O(n² + n·dk + n·dv).

Algebraic Construction, Not Empirical Tiling

FlashAttention and its variants rely on empirical tiling to reduce memory traffic. MoA goes deeper: it fuses array operations using shape-transformation laws from the Mathematics of Arrays framework, guaranteeing correctness without hand-tuning. The derivation proceeds from a Python specification through Operational Normal Form (ONF) to dimension-lifted hardware mapping. No compile-time search, no heuristics — the memory savings are a theorem.

The paper verifies the DNF numerically against PyTorch at full double-precision floating-point on concrete inputs. The algebra matches bit-exact on the outputs (softmax and final attention values), confirming that the transformation preserves the computation while eliminating buffers.

Projected Speedups and Energy Gains at Exascale

A predictive performance model projects 2–100× speedup and 2–50× energy reduction over standard implementations. The advantage widens at scale: exascale systems where DRAM bandwidth is the dominant constraint see the largest gains. The authors explicitly cite relevance to DARPA edge-deployment and DOE exascale priorities — this isn't just a theoretical curiosity.

Because the framework is performance-portable (single algebraic description maps to different hardware), the same kernel can target GPU, CPU, and near-memory compute without rewriting. That's a concrete advantage over hardware-specific accelerators.

What This Enables Next

If the performance model holds at scale, the biggest energy bottleneck in transformer inference just got a formal lower bound — and a practical path to hitting it.


Source: Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.