Source linked

BASIS: Balanced Activation Sketching with Invariant Scalars for Efficient Backpropagation

arxiv.org@frontier_wirelast week·Artificial Intelligence & Machine Learning·1 comments

A new algorithm reduces the spatial bottleneck in deep neural networks, enabling scaling and improving training stability.

sparse-attentionkernel-exploitmevllm-inferencefrontierautomated

BASIS (Balanced Activation Sketching with Invariant Scalars) is an efficient backpropagation algorithm that fully decouples activation memory from the batch and sequence dimensions. This decoupling reduces the spatial bottleneck, enabling the scaling of deep neural networks. The algorithm's theoretical guarantees and empirical validation make it a significant advancement in the field. Theoretically, BASIS reduces activation memory to O(L * RN ) and heavily decreases the backward pass matrix-multiplication footprint. Empirically, training a GPT architecture for 50,000 steps validates our theoretical guarantees: at R = 32, BASIS achieves parity with (and marginally outperforms) exact backpropagation validation loss (6.575 vs. 6.616), acting as an implicit regularizer. Remarkably, the stabilized magnitude trajectory allows the model to converge smoothly even under extreme spatial compression (R = 1), proving the extreme robustness of the estimator.

Source: BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation"

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Artificial Intelligence & Machine Learning

view topic

Multimodal Machine Learning for Ejection Fraction Diagnosis from Electrocardiograms

A new multimodal ML framework combines ECG and EHR features to classify LVEF, outperforming baselines and maintaining performance under temporal validation.

Intelligent Fault Diagnosis for General Aviation Aircraft via Multi-Fidelity Digital Twin and FMEA Knowledge Enhancement

A novel framework for fault diagnosis in general aviation aircraft achieves 96.2% Macro-F1 using multi-fidelity digital twins and FMEA-driven fault injection.

Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and Q/K--V Asymmetry

A systematic study of weight matrix singular value spectra during transformer pretraining reveals three phenomena that fundamentally change how we understand transformer training.

Artifact-based Agent Framework for Adaptive and Reproducible Medical Image Processing

A novel framework for adaptive and reproducible medical image processing addresses the limitations of current medical imaging research by introducing adaptability and reproducibility.

Comments load interactively on the live page.