Source linked

Know2Guess: The Benchmark That Catches LLMs Guessing When They Should Abstain

A new 1,200-item benchmark across five domains measures how well LLMs can distinguish answerable questions from those they should refuse, with Qwen2.5-3B-Instruct achieving the best reliability.

know2guessllm evaluationabstentionbenchmarkqwen25data contamination

Qwen2.5-3B-Instruct beats FLAN-T5 and Llama-3-Instruct on a new benchmark designed to measure exactly where an LLM's knowledge stops and pure guesswork begins. That benchmark, called Know2Guess, packs 1,200 items across five domains with explicit abstention expectations and contamination-risk metadata. It doesn't just ask models to answer correctly - it penalizes them for guessing when they should say "I don't know."

What Know2Guess Actually Measures

Most evaluation benchmarks reward only correct answers. Know2Guess flips that by creating three zones: answer-expected, abstain-expected, and a transition zone between them. Each item comes with a frozen build-time label that tells the evaluator whether the model should answer or abstain. The authors also tag contamination risk - whether the item might appear in training data - so you can separate genuine knowledge from memorization.

The benchmark uses two parsers: a strict parser and a normalized robustness parser. Prompt templates vary across runs to catch prompt idiosyncrasy. The result is a protocol that isolates four distinct failure modes: inability to answer, inability to abstain, refusal of benign items, and contamination-driven overconfidence.

The Model That Does Best (But Still Isn't Great)

Qwen2.5-3B-Instruct wins overall reliability among the tested models - which include FLAN-T5, Qwen2.5-Instruct variants, and Llama-3-Instruct. FLAN baselines remain weak on productive abstention; they either answer everything or refuse everything, missing the nuance entirely. Stronger instruction-tuned models show a selective but incomplete transition from answering to abstaining.

Here's the kicker: even the best model, Qwen2.5-3B-Instruct, leaves answer-expected zones difficult. Calibration is poor - the model's confidence doesn't align with actual correctness. And benign-item refusal persists: models sometimes say "I don't know" to questions they demonstrably can answer. That's a reliability gap you can't fix with a prompt tweak.

Why Contamination Metadata Matters

Most LLM benchmarks treat all questions equally. Know2Guess records contamination risk, letting you check whether a model's correct answer stems from genuine reasoning or from having seen the exact same text during training. If a model scores high on contaminated items but collapses on clean ones, you know it's memorizing, not knowing.

Prompt and parser robustness analyses preserve the main ranking, so the findings aren't an artifact of formatting. The dataset is public on GitHub (github.com/renweimeng/Know2Guess-A-Contamination-Aware-Multi-Zone-Benchmark), which means anyone can reproduce the protocol or extend it to new models.

Know2Guess forces evaluators to think about LLM reliability as a set of interacting dimensions - answerability, abstention, refusal, contamination - not a single score. The next step is to build models that don't just answer correctly but know the shape of their own ignorance.


Source: Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.