4,292 real Reddit threads are the baseline for MiroBench, and every LLM simulator tested falls short of matching them. That's the blunt finding from a new benchmark that measures whether LLM agents reproduce not just fluent text but the actual distributional patterns of human discussion.
What MiroBench Actually Tests
Reddit threads are a solid proxy for real-world social interaction: topic-grounded, multi-party, full of debate, advice, emotion, and toxicity. MiroBench exploits that by building a corpus from 4,292 real threads and then running statistical tests to compare generated discussions against real ones across four axes: repetition and semantic uniformity, narrative content, toxicity and aggression, and structural complexity.
Five domains and five models go through the wringer. The authors don't name the models in the abstract, but the pattern is clear: every simulator produces discussions that are distributionally mismatched with the real threads. That's not a judgment about fluency; it's a measurable gap in the statistics that define how real people actually behave.
The Numbers That Matter
4,292 threads is a lot of ground truth. MiroBench doesn't rely on human raters or subjective quality scores; it uses statistical tests to detect mismatch. That makes it diagnostic: you can see exactly where a simulator fails. Is it reproducing too much uniform structure? Not enough toxic outbursts? Overly sanitized narrative content? The benchmark forces you to confront the gap.
And the gap is real. Even after trying a lightweight prompt-based improvement procedure, gains were limited. You can't just tell an LLM to "be more realistic" and expect it to match the distribution of thousands of human commenters.
Why This Matters for the Simulation Crowd
LLM agents are being deployed to simulate market behavior, political discourse, and customer interactions. If those simulations don't preserve the statistical signature of real human interaction, they're not simulations; they're roleplays that happen to look plausible. MiroBench provides a concrete, repeatable way to measure that gap before you trust your agent-based model's conclusions.
For now, the takeaway is simple: current social simulators are distributionally off, and the fix isn't a better system prompt. MiroBench gives engineers a benchmark to chase, which is the first real step toward closing that gap.
Source: MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions
Domain: arxiv.org
Comments load interactively on the live page.