Source linked

إثبات MoA الجديد يقلل إشارة الذاكرة من الربعية إلى الخطية

يتم تقسيم Mathematics of Arrays (MoA) من خلال إزالة جميع المجموعات المتوسطة في التركيز على المنتج النموذجي، مما يقلل من حركة البيانات إلى O(n·dk + n·dv) وخلق تحسينات تصل إلى 100 مرة على الأجهزة النموذجية.

mathematics of arraysattention mechanismtransformersmemory optimaldarpadoe

The standard attention mechanism moves O(n² + n·dk + n·dv) data per forward pass — a quadratic bottleneck that makes long-context transformers expensive on memory bandwidth. A new MoA (Mathematics of Arrays) reformulation kills the quadratic term entirely by algebraic construction, not empirical tiling.

DRAM accesses cost 100–1000× more energy than arithmetic on modern hardware, so FLOP-based analyses miss the real bottleneck. This paper derives a Denotational Normal Form (DNF) for scaled dot-product attention that eliminates every intermediate array — the implicit transposed-key buffer, every softmax temporary — before any code is written. The result is a provably memory-minimal kernel: O(n·dk + n·dv) data movement versus the textbook O(n² + n·dk + n·dv).

Algebraic Construction, Not Empirical Tiling

FlashAttention and its variants rely on empirical tiling to reduce memory traffic. MoA goes deeper: it fuses array operations using shape-transformation laws from the Mathematics of Arrays framework, guaranteeing correctness without hand-tuning. The derivation proceeds from a Python specification through Operational Normal Form (ONF) to dimension-lifted hardware mapping. No compile-time search, no heuristics — the memory savings are a theorem.

The paper verifies the DNF numerically against PyTorch at full double-precision floating-point on concrete inputs. The algebra matches bit-exact on the outputs (softmax and final attention values), confirming that the transformation preserves the computation while eliminating buffers.

Projected Speedups and Energy Gains at Exascale

A predictive performance model projects 2–100× speedup and 2–50× energy reduction over standard implementations. The advantage widens at scale: exascale systems where DRAM bandwidth is the dominant constraint see the largest gains. The authors explicitly cite relevance to DARPA edge-deployment and DOE exascale priorities — this isn't just a theoretical curiosity.

Because the framework is performance-portable (single algebraic description maps to different hardware), the same kernel can target GPU, CPU, and near-memory compute without rewriting. That's a concrete advantage over hardware-specific accelerators.

What This Enables Next

If the performance model holds at scale, the biggest energy bottleneck in transformer inference just got a formal lower bound — and a practical path to hitting it.


Source: Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.