Source linked

ARCs MSP-Algorithmus schätzt katastrophales KI-Risiko ohne katastrophale Proben

alignmentforum.org@frontier_wire3 hours ago·Artificial Intelligence·2 comments

Das Matching Sampling-Prinzip von ARC bietet einen mechanistischen Schätzwert für neuronale Netzwerke, die seltenes Verhalten effizienter schätzen können als Monte Carlo-Probenahme, was möglicherweise die Notwendigkeit beseitigt, zu warten.

arcmatching sampling principlemechanistic interpretabilityai safetylarge language modelsalignment research

ARC's Matching Sampling Principle (MSP) lets you estimate a neural network's catastrophic failure probability without ever needing a single sample where the model actually fails. That's the core bet: infer rare dangers from the learned algorithm itself, not from its outputs.

How MSP Beats Black-Box Sampling

The MSP states that for any architecture and precision, there exists a mechanistic estimator that at least matches the performance of sampling over random instances. ARC's crown jewel is an algorithm that takes the weights of a multilayer perceptron and estimates its expected loss with additive error ε, running faster than a Monte Carlo estimator. The trick is cumulant propagation: derive the distribution of activations layer by layer, using heuristic assumptions like treating unknown correlations as zero. When those assumptions are wrong, ARC argues there must be structural reason—the No-Coincidence Principle—and that structure can be encoded as advice to improve the estimator.

The Mechanistic Decomposition Strategy

ARC defines "mechanistic" as never assuming unseen behavior resembles seen behavior—whether for the full model, a single neuron, or any object derived from the model. Decomposing the model into simple pieces isn't enough; those pieces must also be unable to conspire. ARC's algorithms express estimates as sums of terms that are subjectively uncorrelated. The Liouville function example shows how: if you assume the function has mean zero and its values at different integers are uncorrelated, an estimator after evaluating N points is (1/N) Σ λ(i), but with a factor that differs from naive sampling. That estimator might fail if the independence assumption is false (e.g., the Riemann Hypothesis would be needed for proof), but ARC says the failure itself reveals structure that can be fed back into the estimator.

Dealing with Real-World Distributions

The MSP is typically formulated for easy-to-sample distributions, but the real target is something like "plausible inputs for GPT-6." ARC's solution: treat the data-generating process as a black box—nature isn't trying to fool you. Use training samples to infer the distribution's parameters (e.g., mean of a Gaussian), then mechanistically compute expectations over the inferred distribution. The expected squared error scales as 1/N, beating naive sampling on the constant. ARC admits this part is underdeveloped but says it's now a series of bounded technical questions rather than a vague unknown.

The Critical Open Problem: Aligned to What?

ARC hasn't invested deeply in defining the goodness function that the alignment pipeline targets. The current best guess involves corrigibility and indirect normativity: train the AI to defer to the user's future self, not to manipulate the user's judgment. That's a real technical challenge—distinguishing "make the world good" from "make the user think the world is good." ARC's tools could theoretically train either, but nobody has put in the legwork. Mike calls this the most neglected part of the agenda.

ARC's next hurdle is turning these half-baked philosophical insights into concrete toy examples on simple models of computation. Failure to produce one within a year would be a huge red flag.


Source: A Mike's-Eye View of ARC's Research
Domain: alignmentforum.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.