Frontier large language models top out at exactly 90.8% on the VerilogEval benchmark, and that ceiling is not lifting with more compute.
That number comes from a new paper that introduces a four-category error taxonomy for LLM-generated hardware designs: syntactic, semantic, solvable functional, and unsolvable functional errors. The last category is the killer. Models can fix syntax bugs with alignment techniques and patch solvable functional errors with repeated sampling, but unsolvable functional failures are baked into pretraining knowledge and resist all scaling tricks.
The 90.8% Ceiling Is a Knowledge Wall
VerilogEval measures how often an LLM produces a correct register-transfer level (RTL) implementation on the first try. The 90.8% plateau is shared by every frontier model tested. That last 9.2% is not a fluke - it is composed entirely of unsolvable functional errors. Test-time compute scaling, chain-of-thought, and larger models all fail to crack it.
The paper argues that these errors stem from fundamental gaps in how LLMs reason about parallel temporal logic versus sequential programming. The models can translate syntax but cannot internalize the hardware concurrency model.
Alignment Makes Compilers, Not Engineers
Surface convergence gap is the paper's term for a nasty trade-off: optimization eliminates syntax errors but concurrently worsens deep functional failures. Alignment techniques like reinforcement learning from human feedback teach models to produce compilable code, not correct hardware. The models learn to match surface patterns without understanding the underlying state machine.
Repeated sampling - generating many candidates and selecting the one that passes tests - can patch solvable functional errors. But the unsolvable ones remain stubborn. The paper's conclusion is blunt: RTL coding capacity is strictly bounded by what the model learned during pretraining.
What This Means for Hardware Design Automation
For anyone building LLM-powered hardware generation pipelines, this paper is a cold dose of reality. Throwing more GPUs at inference or fine-tuning won't break through the 90.8% barrier. The bottleneck is not alignment or inference strategy - it is the model's inability to reason about hardware semantics.
Future work needs to focus on grounding LLMs in formal hardware verification and temporal reasoning, not on better prompt templates. The plateau is a signal that pretraining data alone cannot teach the parallel logic of RTL; the next step is to inject structured hardware knowledge into the model architecture itself.
Source: How LLMs Fail and Generalize in RTL Coding for Hardware Design?
Domain: arxiv.org
Comments load interactively on the live page.