Source linked

KG Triples Steal 2-3x More LLM Attention Than Plain Text-Even When Irrelevant

Even irrelevant knowledge graph triples command 2-3x more attention per token than natural language, compressing demonstration attention by up to 42% across Mistral and LLaMA models.

structural attention taxragknowledge graphsin context learningmistral 7bllama 3 8b

Knowledge graph triples capture 2–3× more attention per token than semantically equivalent natural language text, even when the triples are complete noise. That's the central finding from a new formal analysis of retrieval-augmented generation (RAG) that isolates format from content.

The Structural Attention Tax: 0.70 vs 0.25 per Token

The authors—working with Mistral-7B and LLaMA-3-8B across three QA benchmarks—decompose attention scores into semantic and structural components. KG triples, with their relational delimiters and repeated slot patterns, score roughly 0.70 attention per token against 0.25 for neutral natural-language text. This effect compresses demonstration attention by up to 42%, independent of whether the triples are relevant or noise. That's the structural attention tax: format hijacks the model's limited context window before content even gets a vote.

Task-Matched Retrieval Dominates—But Format Still Bites

Source-task alignment still rules overall performance: BM25 retrieval on the matching corpus achieves 58–62% on HotpotQA, while ConceptNet—even with the same model and gating strategy—drops to 25–27%. That's a >30 percentage point gap that dwarfs all gating strategies (≤2 pp). But within a fixed retrieval source, the structural tax persists. The paper derives a formal compression bound (Proposition 1) linking token-level format bias to demonstration attention loss, and shows that the structural term governs how much attention is diverted while the semantic term governs whether it helps or hurts.

Five Mitigation Strategies, From Zero-Cost to Training-Time

The framework yields five structure-aware mitigations. Format flattening (S3)—rewriting triples as verbalized sentences—is validated by both accuracy and attention-level evidence. Structural dispersal (S1) produces mixed results, illuminating the difficulty of format-level intervention. Other options range from zero-cost prompt modifications to training-time regularisation. The key insight: optimising RAG pipelines now has two orthogonal axes—semantic (what to retrieve) and structural (how to present it). If you're building RAG pipelines, the format of your retrieval chunks is now a first-class optimisation axis—not just what you retrieve, but how you present it.


Source: The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.