Source linked

DiffusionGemmaのOpaqueの深さは、トークンBottleneckで28.6Xから1.1Xに減少

alignmentforum.org@rapid_panther3 hours ago·Artificial Intelligence·4 comments

DiffusionGemma は Gemma より 28.6 倍透明に見えますが、解釈可能なトークンボトルネックはそのギャップを排除します - アルゴリズムの透明性は依然として課題です。

google deepminddiffusiongemmainterpretabilitytransparencyai safetylatent reasoning

DiffusionGemma's opaque serial depth starts at 28.6X that of the equivalent autoregressive Gemma model, meaning 28.6 times more computation happens between interpretable model states. That number sounds catastrophic for anyone hoping to understand what the model is doing. But Google DeepMind's interpretability team found a way to collapse it to just 1.1X without sacrificing performance.

Variable Transparency: The Token Bottleneck Trick

The naive measurement assumes all intermediate self-conditioning vectors are black boxes. The team showed you can replace those vectors with their top-k or top-p tokens - essentially mapping the continuous latent information back into discrete tokens - and downstream benchmarks barely budge. Those top tokens mostly match or are semantically similar to nearby tokens in the final canvas. That means the intermediate states are interpretable, even if we don't yet know exactly how the model uses them.

Algorithmic Transparency: Harder Than It Looks

Variable transparency is only half the story. Algorithmic transparency asks whether we can reconstruct the model's reasoning process from those interpretable snapshots. Autoregressive models give you a clear chronological trace: token by token, you see the exact state at each step. DiffusionGemma generates all tokens on a single canvas at once, and every token can change at every denoising step. The model can use tokens at the end of the canvas to help decide what to put at the beginning - non-chronological reasoning. It can also "smear" probability distributions across adjacent positions when it's confident a token exists but unsure exactly where it goes.

Case Studies and Open Problems

The paper documents specific phenomena like retroactive self-correction: when asked to count perfect squares between 400 and 800, the model initially outputs a wrong answer, lists the squares, then in later denoising steps corrects its earlier output. That's the kind of behavior that makes algorithmic transparency for diffusion models fundamentally different from autoregressive ones. The team includes 24 open problems for the community, focusing on techniques like Natural Language Autoencoders and Activation Oracles that can translate latent activations into natural text. If future latent reasoning architectures regress on monitorability metrics, we'll need those tools ready.


Source: [Linkpost] How Transparent Is DiffusionGemma (and why it matters)
Domain: alignmentforum.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.