Source linked

LLMs كتابة بيانات الأجهزة التي تفقد 30٪ من الخرافات و نصف الصلب الرسمي

ويظهر نموذج HierSVA أن على الرغم من أن LLMs تنتج تصريحات صحيحة على الأرجح في 82.1٪ من الوقت ، فإن هذه التصريحات تكتشف فقط 70.2٪ من الأخطاء المضادة وتغطي فقط 36.2٪ من القوة الرسمية.

hiersvallmhardware formal verificationsystemverilog assertionsbasejump stlarxiv

Twelve LLMs, when asked to write SystemVerilog Assertions for hierarchical RTL modules, produce assertions that prove non-vacuously 82.1% of the time - yet those same assertion sets catch only 70.2% of injected faults and cover a mere 36.2% of the formal core. That gap between "proves" and "catches bugs" is the kind of metric that separates a demo from a production verification flow.

What 342 Modules and 12 LLMs Actually Reveal

HierSVA, a new benchmark suite from anonymous authors (code on GitHub, dataset on Hugging Face), decomposes assertion quality into six axes: syntax correctness, proof success rate, vacuity, specification faithfulness, mutation coverage, and formal core coverage. They built a dataset of 342 modules from BaseJump STL with hierarchy depths 0-9, plus a deep subset of 28 module-bug pairs with natural-language specs and bug variants.

Testing twelve recent LLMs on the full set yields a module-level compile rate of 67.1%. Among assertions that compile and run, 82.1% prove non-vacuously. But the real story is in the downstream metrics: those generated assertion sets detect only 70.2% of eligible injected faults and cover 36.2% of the formal core. In the deep subset (211 evaluable model-module entries), assertion sets flag buggy RTL with 0.87 recall, but 40% of predicted-buggy outcomes are false positives on correct RTL, pinning precision at 0.60.

Agentic Mode: Gains That Fade

Switching LLMs to agentic mode - letting them iterate, call tools, and retry - improves S1-style provability and strength metrics. But the improvement plateaus and oscillates. You don't get a monotonic climb toward perfect assertions; you get diminishing returns that eventually start bouncing around. This suggests the models are hitting a ceiling inherent to the task, not a lack of compute budget.

What This Means for Hardware Verification

The HierSVA benchmark gives the hardware formal verification community something it badly needed: a repeatable, multi-axis measure of what LLMs can actually do on real RTL hierarchies. The numbers confirm that raw assertion generation is not the bottleneck - 82% proof rate is impressive. The bottleneck is fault detection and formal core coverage. An assertion that proves but doesn't cover is a waste of verification cycles.

HierSVA's dataset and pipeline are public. I expect every serious hardware verification team to run their own favorite LLM through this benchmark before trusting it on a tapeout-critical block. The 70.2% fault detection number is a target to beat.


Source: HierSVA: A Data Synthesis Pipeline, Dataset, and Benchmark for LLM-Driven Hierarchical Hardware Formal Verification
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.