The strongest evaluated LLM functionally resolves only 22.9% of MPC-Patch-Bench tasks; after applying cryptographic safety checks, that number drops to 17.1%. That means four out of five generated patches are either functionally wrong or cryptographically unsafe.
Why MPC Repairs Need More Than Functional Correctness
Secure Multi-Party Computation (MPC) is increasingly deployed for privacy-preserving machine learning, biomedical collaboration, and secure analytics. But existing LLM code-repair benchmarks like SWE-bench don't translate directly. MPC repositories are dominated by generic Python infrastructure rather than cryptographic logic, and standard fail-to-pass evaluation is insufficient for code that must also be cryptographically safe. The gap is real: up to 40% of patches that pass all functional tests get rejected by the MPC Verifier for security or numerical-fidelity violations.
How MPC-Patch-Bench Curates and Verifies Patches
MPC-Patch-Bench organizes its 205 fully verified instances around two frameworks. The Data Curation Framework uses a domain-specific agent that filters raw pull requests through three cryptographic layers, then combines with a human-AI completion engine to synthesize missing problem statements and Fail-to-Pass/Pass-to-Pass tests. The MPC Verifier runs dynamic differential testing against plaintext oracles and applies MPC-specific static analysis rules that flag unsafe reveals, insecure arithmetic, and illegal public/private casts. This is not your typical unit-test pass-fail; it's a security audit baked into the evaluation.
The Numbers: 40% of Functional Patches Rejected as Cryptographically Unsafe
Functionally the best LLM resolves 22.9% of tasks. After the MPC Verifier signs off, only 17.1% remain. That's a 25% reduction in effective resolution rate due to cryptographic and numerical violations alone. The benchmark's design exposes that standard functional correctness metrics are blind to the very properties that make MPC code useful in production — properties like preventing secret data from leaking through revealed intermediate values.
MPC-Patch-Bench sets a new bar for LLM agents targeting privacy-preserving infrastructure; expect future evaluations to require both functional and cryptographic verification before any patch is trusted in a secure computation context.
Source: MPC-Patch-Bench: Security-Aware LLM Code Patch for Multi-Party Computation
Domain: arxiv.org
Comments load interactively on the live page.