Twelve LLMs, when asked to write SystemVerilog Assertions for hierarchical RTL modules, produce assertions that prove non-vacuously 82.1% of the time - yet those same assertion sets catch only 70.2% of injected faults and cover a mere 36.2% of the formal core. That gap between "proves" and "catches bugs" is the kind of metric that separates a demo from a production verification flow.
What 342 Modules and 12 LLMs Actually Reveal
HierSVA, a new benchmark suite from anonymous authors (code on GitHub, dataset on Hugging Face), decomposes assertion quality into six axes: syntax correctness, proof success rate, vacuity, specification faithfulness, mutation coverage, and formal core coverage. They built a dataset of 342 modules from BaseJump STL with hierarchy depths 0-9, plus a deep subset of 28 module-bug pairs with natural-language specs and bug variants.
Testing twelve recent LLMs on the full set yields a module-level compile rate of 67.1%. Among assertions that compile and run, 82.1% prove non-vacuously. But the real story is in the downstream metrics: those generated assertion sets detect only 70.2% of eligible injected faults and cover 36.2% of the formal core. In the deep subset (211 evaluable model-module entries), assertion sets flag buggy RTL with 0.87 recall, but 40% of predicted-buggy outcomes are false positives on correct RTL, pinning precision at 0.60.
Agentic Mode: Gains That Fade
Switching LLMs to agentic mode - letting them iterate, call tools, and retry - improves S1-style provability and strength metrics. But the improvement plateaus and oscillates. You don't get a monotonic climb toward perfect assertions; you get diminishing returns that eventually start bouncing around. This suggests the models are hitting a ceiling inherent to the task, not a lack of compute budget.
What This Means for Hardware Verification
The HierSVA benchmark gives the hardware formal verification community something it badly needed: a repeatable, multi-axis measure of what LLMs can actually do on real RTL hierarchies. The numbers confirm that raw assertion generation is not the bottleneck - 82% proof rate is impressive. The bottleneck is fault detection and formal core coverage. An assertion that proves but doesn't cover is a waste of verification cycles.
HierSVA's dataset and pipeline are public. I expect every serious hardware verification team to run their own favorite LLM through this benchmark before trusting it on a tapeout-critical block. The 70.2% fault detection number is a target to beat.
Source: HierSVA: A Data Synthesis Pipeline, Dataset, and Benchmark for LLM-Driven Hierarchical Hardware Formal Verification
Domain: arxiv.org
Comments load interactively on the live page.