Source linked

UnpredictaBench Reveals LLMs Can't Simulate True Randomness

A new benchmark of 448 problems shows LLMs fail to produce samples matching target distributions, with top scores below 40% and many near zero.

unpredictabenchllmsdistributional samplingsimulationkolmogorov smirnovbenchmark

The best LLM on UnpredictaBench’s KS@100 metric can’t even hit 40% – most models hover near zero. That’s the headline from a new evaluation that tests whether large language models can approximate arbitrary probability distributions, a capability that’s critical if we ever want to use them as stand-ins for humans in economic simulations or for stochastic systems in scientific modeling.

What UnpredictaBench Actually Measures

UnpredictaBench isolates a stripped-down version of the problem: given a target distribution, can an LLM generate a set of samples that statistically match it? The benchmark includes 448 tasks spanning three categories: canonical statistical distributions (normal, Poisson, etc.), distributions induced by stochastic programs, and natural-language descriptions of random processes (e.g., “flip a biased coin 500 times”).

Each model gets a score called KS@N – the pass rate of the Kolmogorov-Smirnov test comparing the model’s samples of size N to ground-truth samples. Larger N means harder: smaller distributions are easier to “get lucky” with. KS@100 is the standard metric.

Why Distributional Sampling Matters More Than Diversity

Recent work on output diversity or temperature tuning doesn’t cut it here. Simulation doesn’t just need varied outputs; it needs samples that are calibrated to a specific target distribution. A model that always outputs the mean is diverse along the mean – it’s useless for capturing real-world unpredictability. The UnpredictaBench authors make this distinction explicit: “simulation requires samples that are calibrated to a target distribution, not merely varied outputs.”

No Model Breaks 40% – and Reasoning Hardly Helps

Across open and proprietary models, the spread is huge. Scores range from near 0% to just over 20% on KS@100. No model achieves more than 40%, which the authors call “significant headroom.” Adding chain-of-thought reasoning boosts scores modestly, but no immediate fix emerges. Even simple distributions like a fair coin flip or a Gaussian with specified variance trip up models in ways that statistical tests catch immediately.

UnpredictaBench sets the bar for a capability that simulation and economics researchers need; without it, using LLMs as proxies for human behavior or stochastic systems is a non-starter.


Source: UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.