Out of 20,799 raw vulnerability candidates generated by Cloudflare's automated security harness, only 7,245 survived validation and deduplication as actionable bugs: a 65% noise cut that explains why orchestrating LLMs matters more than the model itself.
From a 450-Line Skill to a Fleet-Wide Pipeline
Cloudflare started with a ~450-line security-audit skill that ran a 7-phase session on a single repo. Three parallel recon agents wrote an architecture.md, one Hunter ran per-class attacks, adversarial validators tried to disprove each finding, and a fresh agent re-verified survivors. It worked, but a single run only found about half the bugs multiple runs would catch.
Three walls appeared fast: context exhaustion after an hour, crashes that wiped hours of work, and zero cross-repo visibility. Cloudflare broke through by externalizing state into a SQLite database keyed by (run_id, repo, stage), making the LLM a stateless compute engine. Every stage writes immediately, so a crash costs only the task in flight. The mapping from skill to harness was nearly one-to-one: Recon, Hunt, Validate, Gapfill, Dedup, Trace, Feedback, and Report stages now run as a continuous producer-consumer loop over 128 repos.
Two-Stage Architecture: Discovery and Validation
The system splits into the Vulnerability Discovery Harness (VDH) and the Vulnerability Validation System (VVS). VDH uses one model for hunting; VVS uses a completely different model for judgment. Forcing Model B to evaluate Model A's output ensures an adversarial, unbiased third-party check. No model can grade its own homework.
Dynamic threat modeling happens during Recon: the agent writes a custom taxonomy for each repo, inventing attack classes beyond the ten built-in ones (injection, memory corruption, timing side channels, etc.). Hunters don't just read code; they compile fragments inside a sandbox (using unshare) and crash binaries to prove exploitability. When a Hunter needs a tool it doesn't have, it writes to a central wishlist (25,472 writes across 128 repos). One example: "I need a FreeBSD VM to confirm this PoC end-to-end."
Deduplication scales at O(N^2) if you use an LLM naively. Cloudflare's deterministic code builds inverted indexes over files, functions, trust boundaries, and rare tokens to generate a short candidate list. Only then does a Dedup agent reason over that list. Stable cross-run keys prevent duplicate spawning. So far 5,442 findings were folded as duplicates.
Costs, Metrics, and Real-World Impact
The harness budgets per repo with a strict task cap and a worker pool of 50-200 workers. A full scan of a complex repo (~30k lines) takes 3-4 hours to produce ~100 raw candidates, then 3 more hours for dedup and judgment compressing them to ~80 distinct bugs. The automated Fixer processes each bug in 5 minutes, writing a functional patch and regression test. Total time from discovery to pull request: about 14 hours. Critical flaws are fast-tracked to human review and patched in production within 5 days; lower-urgency bugs roll out over 15-20 days.
Cloudflare is releasing the initial skill on GitHub (github.com/cloudflare/security-audit-skill) as a starting point for anyone building their own harness. The orchestration layer, not the model, is what lasts.
Source: Build your own vulnerability harness
Domain: blog.cloudflare.com
Comments load interactively on the live page.