Frustrated Synchronization Network Beats Transformer on Language Modeling at 1M Parameters

At 1 million parameters, a Frustrated Synchronization Network (FSN) trained on enwik8 converged to a loss of 1.5953, beating a tuned RoPE-SwiGLU transformer's 1.611 after 50 epochs. That's not a fluke: every 30-epoch seed of the FSN finished below the transformer's converged 50-epoch loss, and the FSN's completed 50-epoch runs produced a tight spread of 1.5953 +/- 0.0014.

Phase Coupling as a Replacement for Softmax Attention

The paper introduces the Frustrated Synchronization Network, where token states are phases on a torus. Instead of query-key dot products, the entire value pathway is a single learned complex coupling kernel over harmonics and a one-step delay. Each component of the kernel is a frustration in the synchronization sense: static Kuramoto-Sakaguchi frustration angles, repulsive Daido components, and a delay term that couples each token to the successors of the tokens it attends to. That delay term is algebraically identical to Kuramoto-Sakaguchi coupling whose frustration angle is the data's own transition, so next-token prediction is literally synchronization frustrated by the data.

Concrete Wins on Text and Code

On character-level text and code at matched 1M-parameter budgets, the FSN's validation loss sits below the transformer's at every measured epoch. The comparison survives training the baseline to full convergence. On natural text, the unfrustrated base layer falls behind the converged transformer at every copy depth, but the kernel reverses the deficit at depths of four tokens and beyond. Long-range copy events, which plague attention models, become a strength.

No MLP, No Problem

A variant replaces every feed-forward block with mean-field coupling to learned collective modes, leaving no multilayer perceptron in the stack. It tracks the transformer. That means the model is learning entirely through phase interactions and frustration, not through separate nonlinear transformations. The scale ladder runs through 4 million parameters with the advantage persisting; larger scales are marked as in progress.

If this holds at scale, attention as frustrated synchronization offers a fundamentally different inductive bias from softmax attention. The authors have shown that computation lives in the departures from perfect synchronization, not in the alignment of keys and queries. That's a shift worth watching.

Source: Attention as Frustrated Synchronization
Domain: arxiv.org

Frustrated Synchronization Network Beats Transformer on Language Modeling at 1M Parameters

Phase Coupling as a Replacement for Softmax Attention

Concrete Wins on Text and Code

No MLP, No Problem

More in Artificial Intelligence