Kontrastive Reflection Lifts Prompt Genauigkeit von 51,4% bis 60,4% auf HotpotQA

One tree-selected contrastive repair pushed held-out exact-match accuracy from 51.4% to 60.4% on HotpotQA — a 9-point gain without regression checks.

That’s not a tuning trick. It’s the result of Contrastive Reflection, a prompt-optimization framework that makes LLM agent debugging look like real engineering: identify which behavior failed, find a nearby success, and ask a Teacher LLM to write a targeted edit that fixes only what’s broken.

What Makes Contrastive Reflection Different

Most prompt optimizers treat the problem as black-box search — mutate the prompt, test on a held-out set, repeat. Contrastive Reflection opens the box. It uses structured traces from QA agents (retrieval and reasoning steps) and grading agents (dimension-level scores and rationales) to carve out error-anchored behavioral slices. These are specific regions of the input space where the agent consistently fails. The framework then adds nearby successful examples from the same slice, creating a contrastive pair that tells the Teacher LLM what change actually matters.

Failure-only or random-evidence variants of the method improve less and break more previously correct examples. That’s the key insight: contrastive evidence beats raw failure data every time.

How the Loop Works

Starting from a task-centric quality definition, the framework uses a tree-based slice selector to cluster examples into regions. For each region with high error density, it pulls in a successful example from the same region and feeds both to a Teacher LLM, which proposes a prompt edit. That candidate is accepted only when validation performance improves — optionally subject to regression checks that ensure you don’t fix one slice while breaking another.

The loop is inspectable. You can see which behavioral slice triggered the edit, what the contrastive pair looked like, and what the Teacher LLM changed. Compare that to MIPROv2, which optimizes via Bayesian search over prompt components with no notion of where errors live.

Why It Beats Baselines

On a public HotpotQA retrieval-augmented QA setup, Contrastive Reflection reached 60.4% exact match. MIPROv2 hit 59.4%; GEPA hit 57.0%. The light instruction-only comparison puts it near the top of modern optimizers without requiring expensive LLM calls for every candidate. The tree-based slice selector is a means to an end — the real contribution is the contrastive reflection loop itself.

Failure-only variants degraded performance, proving that blind failure feedback isn’t enough. You need the contrast. You need to see what working looks like near the same failure mode.

The next step is applying the same loop beyond QA — to multi-turn agents and tool-use pipelines where repair traces are even richer, and where the cost of a bad prompt is a cascade of wrong actions.

Source: Contrastive Reflection for Iterative Prompt Optimization
Domain: arxiv.org

Kontrastive Reflection Lifts Prompt Genauigkeit von 51,4% bis 60,4% auf HotpotQA

What Makes Contrastive Reflection Different

How the Loop Works

Why It Beats Baselines

More in Artificial Intelligence