Source linked

UnpredictaBench révèle que les LLM ne peuvent pas simuler le vrai hasard

Un nouveau benchmark de 448 problèmes montre que les LLM ne produisent pas d'échantillons correspondant aux distributions cibles, avec des scores supérieurs inférieurs à 40% et beaucoup près de zéro.

unpredictabenchllmsdistributional samplingsimulationkolmogorov smirnovbenchmark

The best LLM on UnpredictaBench’s KS@100 metric can’t even hit 40% – most models hover near zero. That’s the headline from a new evaluation that tests whether large language models can approximate arbitrary probability distributions, a capability that’s critical if we ever want to use them as stand-ins for humans in economic simulations or for stochastic systems in scientific modeling.

What UnpredictaBench Actually Measures

UnpredictaBench isolates a stripped-down version of the problem: given a target distribution, can an LLM generate a set of samples that statistically match it? The benchmark includes 448 tasks spanning three categories: canonical statistical distributions (normal, Poisson, etc.), distributions induced by stochastic programs, and natural-language descriptions of random processes (e.g., “flip a biased coin 500 times”).

Each model gets a score called KS@N – the pass rate of the Kolmogorov-Smirnov test comparing the model’s samples of size N to ground-truth samples. Larger N means harder: smaller distributions are easier to “get lucky” with. KS@100 is the standard metric.

Why Distributional Sampling Matters More Than Diversity

Recent work on output diversity or temperature tuning doesn’t cut it here. Simulation doesn’t just need varied outputs; it needs samples that are calibrated to a specific target distribution. A model that always outputs the mean is diverse along the mean – it’s useless for capturing real-world unpredictability. The UnpredictaBench authors make this distinction explicit: “simulation requires samples that are calibrated to a target distribution, not merely varied outputs.”

No Model Breaks 40% – and Reasoning Hardly Helps

Across open and proprietary models, the spread is huge. Scores range from near 0% to just over 20% on KS@100. No model achieves more than 40%, which the authors call “significant headroom.” Adding chain-of-thought reasoning boosts scores modestly, but no immediate fix emerges. Even simple distributions like a fair coin flip or a Gaussian with specified variance trip up models in ways that statistical tests catch immediately.

UnpredictaBench sets the bar for a capability that simulation and economics researchers need; without it, using LLMs as proxies for human behavior or stochastic systems is a non-starter.


Source: UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.