Source linked

التدوينات المضادة تكسير كود LLM، ولكن إلغاء 86٪ اكتشاف الفساد

تنبيهات المتطلبات لكل خط يقلل من تحديد الكود LLM (d≈-0.74) ولكن تتيح اكتشاف الاكتئاب التلقائي في 86-88% TDR مع 0% FPR في جميع النماذج.

arxivtracesddspec driven developmentllm code generationhallucination detectionclaude sonnet

Forcing LLMs to cite every line of code they generate drops determinism by a Cohen's d of about -0.74 — but it’s the only way to catch hallucinations automatically, with detection rates hitting 86-88% at 0% false positives.

Two pre-registered studies compared three Spec-Driven Development frameworks—traceSDD (mandatory per-line requirement citations), Spec Kit (artifact-level traceability), and OpenSpec (post-hoc external trace maps)—across Claude Sonnet 4.6 (N=20, 240 implementations) and GLM-5-turbo (N=50, 600 implementations). The finding is a clean trade-off: citations reduce output determinism (lexical similarity across independent sessions) but unlock automated hallucination detection that no other approach can touch.

The Determinism–Verifiability Trade-Off

The uncited condition produces significantly higher determinism than the cited condition in both models (Claude: d=-0.76, p=0.003; GLM: d=-0.72, p<0.001). Yet that determinism comes with a blind spot: automated hallucination detection is zero for any framework that doesn't enforce per-line citations. traceSDD's cited condition achieves a True Detection Rate (TDR) of 86.4% on Claude and 88.0% on GLM, while Spec Kit and OpenSpec both yield 0% — and the False Positive Rate is 0% across both studies. You cannot have both high determinism and automated verifiability.

traceSDD Beats Spec Kit But Not OpenSpec on Determinism

When comparing the cited frameworks head-to-head, traceSDD significantly outperforms Spec Kit on determinism (Claude: d=0.47, p=0.049; GLM: d=0.42, p=0.003). Against OpenSpec, however, the advantage vanishes (Claude: d=0.18, p=0.44; GLM: d=0.14, p=0.32). That doesn't mean OpenSpec is better — it just means its post-hoc trace maps don't hurt determinism as much as Spec Kit's artifact-level approach, but they also contribute nothing to hallucination detection.

What This Means for Production Code Generation

If you're generating code with LLMs and care about runtime correctness, you need a way to verify that each generated statement fulfills a requirement. traceSDD's citation discipline gives you that — at the cost of less predictable output across sessions. The effect size is large and consistent across two different model architectures, which suggests this isn't a Claude or GLM quirk. It's a fundamental property of the annotation mechanism itself. For teams shipping LLM-generated code, traceSDD offers a verifiable path forward — provided they're willing to pay the determinism tax.


Source: Citation Discipline in Spec-Driven Development: A Cross-Model Empirical Study of Output Determinism and Automated Hallucination Detection in LLM-Generated Code
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.