Source linked

ARC startet 100.000-Dollar-Challenge, um White-Box-MLPs zu lösen

alignmentforum.org@frontier_wire1 hour ago·Artificial Intelligence·2 comments

Ein neuer Wettbewerb zielt darauf ab, die White-Box-Schätzung für zufällige MLPs zu verbessern, und geht über die Black-Box-Probenahme hinaus, um hochintelligente KI-Systeme besser zu verstehen.

arcaicrowdmlpmechanistic interpretabilityartificial intelligence

White-box approaches offer a way to answer questions about highly intelligent AI systems such as, "Are there unusual situations in which the system would undermine human control?". Running the system on the entire input space is not a reliable way to answer such questions, because a highly intelligent system might not fall for our "honey pots".

Why White-Box Approaches Beat Black-Box Sampling

Black-box testing relies on sampling inputs to see how a model reacts. For highly intelligent systems, this is fundamentally unreliable; a sufficiently smart agent might avoid the specific "honey pots" or edge cases we've designed to test its alignment.

To truly understand if a system might undermine human control, we need white-box approaches that leverage direct access to the model's internals. ARC's strategy is to build up to this challenge by first mastering white-box estimation for randomly-initialized networks, and then adapting those methods to trained networks as they evolve through training.

Scaling Mechanistic Estimation Beyond Large Widths

Current white-box methods outperform black-box methods for MLPs with large widths, but they break down as depth increases. This contest aims to bridge that gap.

Contestants must design algorithms that take in a set of weights and produce an estimate for the expected output. The goal is to minimize mean squared error (MSE) under specific computational constraints. To prevent participants from gaining an advantage through heavily optimized numerical kernels, ARC and AIcrowd have implemented a FLOP-counting scheme. This forces a focus on higher-level algorithmic design rather than low-level optimization.

While "white-box" is in the challenge name, contestants can use any method, including black-box sampling. However, the strongest results are expected to be "mechanistic," mirroring existing research in the large-width setting.

This challenge enables the development of more robust tools for mechanistic interpretability, moving us closer to answering fundamental questions about AI safety and control.


Source: Announcing the ARC White-Box Estimation Challenge
Domain: alignmentforum.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.