Source linked

New Concentration Bound annule la variance latente dans les benchmarks d'IA comme MMLU

Une analyse inspirée de Finetti montre que pour les contrastes linéaires à somme nulle, le terme de mélange latent - la principale source d'incertitude dans les données échangeables - annule exactement, ce qui donne un lien de type Hoeffding plus serré pour...

arxivmmluconcentration inequalitiesde finettiai benchmark uncertaintyexchangeable sequences

The latent variance you've been ignoring when you subsample a benchmark like MMLU doesn't just shrink - it disappears exactly for the estimator you actually use. That's the punchline of a new arXiv paper that applies de Finetti's theorem to concentration inequalities for infinitely exchangeable sequences.

The Latent Mixture Cancellation

Standard concentration bounds for exchangeable data (like questions sampled from domains in MMLU) have two terms: one from sampling noise, and one from the latent mixture - the variance due to the unknown distribution of question types. The authors prove that for any function with bounded-difference constants $c_1, \dots, c_n$, the deviation decomposes cleanly when conditioned on the de Finetti directing measure. The effective variance proxy becomes $\frac{1}{4}\sum_i c_i^2 + \sigma_{\mathrm{mix}}^2$, where $\sigma_{\mathrm{mix}}^2$ is the subgaussian parameter of the latent mixture.

Crucially, for zero-sum linear contrasts - like the difference between a subsample mean and the full population mean - the $\sigma_{\mathrm{mix}}^2$ term cancels exactly. The bound reduces to a mixture-free Hoeffding-type inequality. No hidden dependency on domain imbalance. No need to estimate the mixing distribution.

What This Means for MMLU and Similar Benchmarks

MMLU and its descendants treat questions as exchangeable across domains (math, history, law, etc.). When you estimate a full-benchmark accuracy from a random subset, the naive bound adds uncertainty for the unknown domain mix. The new result says: that extra uncertainty is an illusion for the accuracy difference itself. The bound is tight and distribution-free.

The paper explicitly frames this as a direct de Finetti mechanism that recovers the infinite-extendibility limit of recent finite-exchangeable results. It also provides a domain-stratified hierarchical model for bounding accuracy score uncertainty, plus a cost-saving guarantee: you can estimate the full benchmark score from a random subset with a known, uniform tail bound, no matter how the domains are distributed.

Why Engineers Should Care

If you're running large-scale evaluations of LLMs on composite benchmarks, you already know the pain of full benchmark cost. This gives you a theoretical license to subsample aggressively while still putting a rigorous confidence interval on your estimate. No more hand-waving about domain coverage. The bound holds for any exchangeable data structure, which includes most modern AI benchmarks that draw from heterogeneous sources.

The next step is clear: implement this bound as a drop-in replacement for the naive Hoeffding or empirical Bernstein used in evaluation pipelines. The paper's construction works for any zero-sum contrast, not just means, so the same logic applies to win rates, vote margins, and other pairwise comparisons common in AI eval.


Source: Bounded Difference Concentration for Infinitely Exchangeable Sequences with Applications to AI Benchmark Uncertainty
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.