Orthogonalization Lifts mLSTM Recall from 17% to 58% on Hard Noisy Tasks

From 4 solved seeds out of 24 to 14–16 — that’s the jump Ayush Tambde’s orthogonalization trick delivers for mLSTMs on the hardest noisy associative recall tasks. On the MAD benchmark with vocab 96 and sequence length 1024, raw mLSTMs barely function. Apply a few Newton-Schulz iterations during reads, and suddenly they work.

Why Noisy Recall Matters for Recurrent Models

Transformers own associative recall because every token can peek directly at any earlier token. That quadratic cost isn’t a problem when you have unlimited compute, but it kills you in long-horizon RL — think Dreamer-style agents — where sequences stretch into thousands of steps. Recurrent neural networks stay linear in memory and time, but they’ve never matched attention on recall.

mLSTM is the best RNN we have for this, maintaining a matrix memory instead of a single hidden state. It crushes the MQAR benchmark. But MQAR tests pure recall without noise. Real environments have distractors, so Tambde turned to MAD’s noisy associative recall (NAR) suite, where keys and values get interleaved with distractor tokens. That’s where mLSTM starts to choke.

Borrowing from Muon: Orthogonalization During Reads

Muon, the optimizer that’s been eating Adam’s lunch in language modeling, orthogonalizes its momenta to prevent a few directions from dominating updates. Recent work showed Muon beats Adam specifically on tail-end associative memory learning. Tambde reasoned: why not orthogonalize the memory matrix itself during the mLSTM read operation?

The implementation is minimal: normalize by Frobenius norm (epsilon 1e-6), run five Newton-Schulz iterations, and let gradients flow through. Crucially, the orthogonalized version is only used for readouts — writing the orthogonalized memory back degraded performance. All models trained with AdamW at batch size 64 for 2k steps, learning rate swept over 3e-4 to 1e-2.

Results That Demand Attention

Orthogonalization improved success rate and mean accuracy across every tested configuration. The gap widens at larger vocab sizes, exactly where raw mLSTMs collapse. At vocab 96 and seq len 768, solved seeds jumped from 4 to 14; at seq len 1024, from 4 to 16. That’s a small intervention — a few extra FLOPs and wall-clock time, zero change in parameter count.

A note of caution: these experiments use a small model regime and synthetic tasks. The code is fully reproducible, and Tambde is explicit that NAR gains might not transfer directly to real-world benchmarks at scale. But the pattern is striking: a cheap linear-algebra trick lifts a recurrent architecture from broken to functional on precisely the kind of noise-robust recall that long-horizon agents need.

If these gains hold when models grow, orthogonalized mLSTM could give us a recurrent workhorse that finally doesn’t make you choose between memory and compute.

Source: Matrix Orthogonalization Improves Memory in Recurrent Models
Domain: ayushtambde.com

Orthogonalization Lifts mLSTM Recall from 17% to 58% on Hard Noisy Tasks

Why Noisy Recall Matters for Recurrent Models

Borrowing from Muon: Orthogonalization During Reads

Results That Demand Attention

More in Artificial Intelligence