The authors formalize evaluation as policy-grounded correctness and introduce the Defensibility Index (DI) and Ambiguity Index (AI) to measure reasoning stability without additional audit passes. The Probabilistic Defensibility Signal (PDS) is derived from audit-model token logprobs and is used to verify whether a proposed decision is logically derivable from the governing rule hierarchy. The authors validate the framework on 193,000+ Reddit moderation decisions across multiple communities and evaluation cohorts, finding a significant gap between agreement-based and policy-grounded metrics. They further show that measured ambiguity is driven by rule specificity and that a Governance Gate built on these signals achieves 78.6% automation coverage with 64.9% risk reduction. This preprint is critical for Principal Engineers, CISOs, ML Researchers, and Technical Founders as it provides a novel evaluation framework for rule-governed AI systems.
Source: Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI
Comments load interactively on the live page.