Source linked

DiScoFormer réduit l'erreur de densité 37x dans 100 dimensions, pas de retraitement nécessaire

Un seul transformateur estime à la fois la densité et le score à partir d'un échantillon, battant KDE par ordres de magnitude dans les dimensions élevées et s'adaptant à de nouvelles distributions en mouvement.

allenaidiscoformertransformerdensity estimationscore matchingkernel density estimation

In 100 dimensions, DiScoFormer cuts density error by more than 37x and score error by 6.5x compared to the best hand-tuned kernel density estimator. KDE runs out of memory; DiScoFormer keeps improving as you add samples. And it does all this without retraining on your specific distribution.

Why Density and Score Matter Together

Every diffusion model, Bayesian sampler, or plasma simulation needs two things from a finite sample: the density (what values are common) and the score (the direction density rises fastest). Classical KDE works anywhere but falls apart in high dimensions. Neural score-matching stays accurate in high D but requires retraining per distribution. DiScoFormer collapses both into one transformer forward pass: give it a set of points, and it returns density and score at any query location.

Architecture: Cross-Attention with a Built-In Consistency Check

The model uses stacked transformer blocks with cross-attention, so it evaluates density and score at arbitrary points—not just where you have data. A shared backbone feeds two output heads: one for density, one for score. Because the score is the gradient of the log-density, any mismatch between heads creates a label-free consistency loss. At inference, hold the context fixed, take a few gradient steps on that loss, and the model adapts to an out-of-distribution input on the spot—no ground-truth density or score required.

Allen AI also shows analytically that a single attention head's weights approximate a Gaussian kernel over the data, making KDE a special case. DiScoFormer doesn't throw away classical methods; it includes KDE as the simplest instance and learns multiple scales simultaneously.

Training on Infinite GMMs, Testing on Laplace and Student-t

Every training batch draws a fresh Gaussian Mixture Model—GMMs are universal approximators with closed-form densities and scores, so supervision is exact and unlimited. Despite training only on mixtures of Gaussians, DiScoFormer generalizes to non-Gaussian shapes like Laplace and Student-t, and to mixtures with more modes than it ever saw. KDE's only remaining advantage is speed on tiny datasets.

What excites me most is the reuse angle. Score estimation is the shared dependency across generative modeling, Bayesian inference, and scientific computing. A single pretrained estimator that stays accurate in high dimensions and adapts without retraining could cut cost across all those fields at once—one model, everywhere score and density show up.


Source: DiScoFormer: One transformer for density and score, across distributions
Domain: huggingface.co

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.