LLM Proof Autoformalizers Are Brittle to Paraphrase and Counterfactual Edits

Seven out of seven LLM-powered proof autoformalizers for Lean 4 flunk a new robustness benchmark that tests two deceptively simple perturbations: paraphrasing the informal proof and making a counterfactual local edit.

Researchers at UC Riverside built this benchmark on top of miniF2F and MATH-500, two standard datasets for mathematical reasoning. They define global perturbation as rewriting the informal proof in a different style while preserving its logical content; a robust autoformalizer should output the same formal proof. Local perturbation alters a single value, symbol, or step, possibly in a counterfactual way; a robust model should faithfully encode that change, not ignore it or revert to the original.

Global Paraphrase: Correctness Collapses Under Rewording

Under global perturbations, every model showed significant drops in correctness. The paper states "all of which are sensitive to global perturbations." That is damning: if a proof is rephrased from conversational to terse, the formal output becomes unreliable. A mathematician who writes an informal proof in their own style cannot trust the autoformalizer to produce consistent Lean 4 code.

Local Perturbations: Models Ignore Counterfactual Changes

Local perturbations are even worse. The autoformalizers "mostly fail to remain faithful" - they largely ignore the injected counterfactual and produce the original formalization. This points to a deeper problem: the models are using surface-level pattern matching, not reasoning about the proof's logical structure. A local change like swapping a hypothesis should produce a different formal proof; instead, these models hallucinate the old one.

What This Means for Formal Verification

The benchmark is automated: they measure correctness stability under global perturbations and faithfulness under local ones. Code and data are available on GitHub (ucr-rai/robust-proof-autoformalization). The seven models tested are not named in the abstract, but likely include GPT-based, LLaMA-based, and specialized autoformalizers. Whatever their architectures, they share the same fragility.

Until LLM-based autoformalizers can handle both global rewording and local counterfactual edits with high reliability, they are not ready for real-world use where informal proofs come in varied styles and errors are introduced intentionally. The UC Riverside benchmark gives the community a clear target: pass these stress tests, then we can talk about deployment.

Source: Evaluating the Robustness of Proof Autoformalization in Lean 4
Domain: arxiv.org

LLM Proof Autoformalizers Are Brittle to Paraphrase and Counterfactual Edits

Global Paraphrase: Correctness Collapses Under Rewording

Local Perturbations: Models Ignore Counterfactual Changes

What This Means for Formal Verification

More in Artificial Intelligence