Source linked

Manifestation Units: A Protocol for Composable, Queryable Neural Network Interpretability シングル

ニューラルネットワークのコンポーネント分析を組織するための新しいタップルプロトコルは、非構造化されたベースラインを上回り、既知のIOI回路メンバーを回復し、解釈のための最小の2フィールドコアを明らかにします。

mechanistic interpretabilitymanifestation unitstransformer architecturesrepresentation bottleneckgpt 2cnn

The biggest problem with mechanistic interpretability today isn't that we can't figure out what a neuron does — it's that once we do, that knowledge sits in a Jupyter notebook, useless for any other model or task. A new paper proposes a protocol called Manifestation Units that turns that ad-hoc output into structured, queryable data, and the results suggest we've been missing a systematic way to reuse what we learn.

A Protocol That Makes Interpretability Outputs Composable

The authors define a typed tuple (E, S, R, D, G) — extended with attention-head primitives (T) for transformers — that organizes per-component statistics into structured fields. These fields are populated automatically and queried through hybrid retrieval. They tested the protocol across three architectures: a beta-VAE for generative vision, a CNN for discriminative vision, and GPT-2 for language. The idea isn't to scale to frontier models but to provide schema infrastructure that makes component-level analyses actually reusable.

Typed Structure Beats Unstructured Retrieval by a Clear Margin

On retrieval tasks, the typed Manifestation Unit schema substantially outperformed unstructured baselines. More importantly, the team ran causal sufficiency and necessity tests on CNN filters retrieved by the schema under matched-budget controls — the filters passed both criteria. The schema also absorbed attention-head primitives without modification and set-recovered known IOI circuit members under retrieval-budget-matched controls. That's not hand-wavy alignment; that's a concrete check that the protocol captures real causal structure.

The Irreducible Core: Only S and R Matter

Perhaps the sharpest finding: after systematic ablation, the protocol reveals an irreducible two-field core — S (semantics) and R (role) — with the remaining fields either redundant or actively interfering. D (direction) and G (gain) don't add signal; in some configurations they hurt retrieval performance. This is the kind of empirical pruning that turns a research artifact into an engineering primitive.

What this enables next: a shared schema for mechanistic interpretability that lets practitioners query across models and tasks without re-reading each other's notebooks. The Manifestation Unit protocol won't replace the need for frontier-scale validation, but it gives the field a way to stop starting from scratch every time.


Source: Representation as a Bottleneck for Mechanistic Interpretability: The Manifestation Unit Protocol
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.