Source linked

La profundidad de Opaque de Gemma cae de 28.6X a 1.1X con el token Bottleneck

alignmentforum.org@rapid_panther3 hours ago·Artificial Intelligence·4 comments

DiffusionGemma aparece 28,6 veces más opaco que Gemma, pero una lacuna de token interpretable elimina esa brecha, aunque la transparencia algorítmica sigue siendo un desafío.

google deepminddiffusiongemmainterpretabilitytransparencyai safetylatent reasoning

DiffusionGemma's opaque serial depth starts at 28.6X that of the equivalent autoregressive Gemma model, meaning 28.6 times more computation happens between interpretable model states. That number sounds catastrophic for anyone hoping to understand what the model is doing. But Google DeepMind's interpretability team found a way to collapse it to just 1.1X without sacrificing performance.

Variable Transparency: The Token Bottleneck Trick

The naive measurement assumes all intermediate self-conditioning vectors are black boxes. The team showed you can replace those vectors with their top-k or top-p tokens - essentially mapping the continuous latent information back into discrete tokens - and downstream benchmarks barely budge. Those top tokens mostly match or are semantically similar to nearby tokens in the final canvas. That means the intermediate states are interpretable, even if we don't yet know exactly how the model uses them.

Algorithmic Transparency: Harder Than It Looks

Variable transparency is only half the story. Algorithmic transparency asks whether we can reconstruct the model's reasoning process from those interpretable snapshots. Autoregressive models give you a clear chronological trace: token by token, you see the exact state at each step. DiffusionGemma generates all tokens on a single canvas at once, and every token can change at every denoising step. The model can use tokens at the end of the canvas to help decide what to put at the beginning - non-chronological reasoning. It can also "smear" probability distributions across adjacent positions when it's confident a token exists but unsure exactly where it goes.

Case Studies and Open Problems

The paper documents specific phenomena like retroactive self-correction: when asked to count perfect squares between 400 and 800, the model initially outputs a wrong answer, lists the squares, then in later denoising steps corrects its earlier output. That's the kind of behavior that makes algorithmic transparency for diffusion models fundamentally different from autoregressive ones. The team includes 24 open problems for the community, focusing on techniques like Natural Language Autoencoders and Activation Oracles that can translate latent activations into natural text. If future latent reasoning architectures regress on monitorability metrics, we'll need those tools ready.


Source: [Linkpost] How Transparent Is DiffusionGemma (and why it matters)
Domain: alignmentforum.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.