Source linked

ProvenanceGuard reduce los errores de desequilibrio del agente LLM al 1,8% en SafetyBench

Un nuevo marco basado en la procedencia reduce el error de detección de malalineación del 42,9% al 1,8% en Agent-SafetyBench, superando las líneas de base de LLM como juez en un amplio margen.

provenanceguardllm agentstool use safetyagent safetybenchworkbenchlarge language models

ProvenanceGuard slashes misalignment detection error from 42.9% to 1.8% on Agent-SafetyBench — a 41.1 percentage point drop over the LLM-as-a-judge baseline.

Existing runtime guardrails for LLM agents that invoke external tools rely on a second LLM to judge alignment. That approach produces inconsistent, hard-to-audit verdicts. The paper's authors observed that an agent's context already contains traceable evidence — conversational history, user instructions, tool outputs — so why not check that a proposed tool call is supported by that evidence? That's provenance analysis.

Three Misalignment Types, One Pipeline

ProvenanceGuard formalizes misalignment as missing or contradictory evidence in the agent's provenance chain. The pipeline checks for three distinct failure modes before any tool executes: unsupported tool choice, unsupported arguments, and unsupported timing or sequence. Only if the call passes all three checks does it go through.

This is not an LLM judge in the usual sense. It's a structured, multi-stage reasoning system that compares the agent's intended action against a structured representation of its context — think of it as a compiler that verifies type safety, but for tool invocations.

Hard Numbers Across Ten Backbones

Evaluated on Agent-SafetyBench and WorkBench across 10 different backbone LLMs, ProvenanceGuard's error rate on misaligned traces dropped from 42.9% to 1.8% on the former, and from 32.1% to 17.3% on the latter. Equally important: intervention burden on task-successful aligned traces fell from 30.5% to 12.8%, with no statistically significant increase in false positives.

That means the framework rarely blocks a legitimate action while catching nearly all dangerous ones. For engineers deploying agents that execute code, send emails, or modify files, that's the difference between a guardrail you trust and one you ignore.

ProvenanceGuard's approach suggests a future where agent safety isn't an opaque LLM popularity contest, but a deterministic audit of intent against evidence.


Source: Safeguarding LLM Agents from Misalignment through Provenance Analysis
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.