Cloudflare's AI Pipeline Filters 20,799 Security Leads Down to 7,245 Verified Fixes

Out of 20,799 raw vulnerability candidates generated by Cloudflare's automated security harness, only 7,245 survived validation and deduplication as actionable bugs: a 65% noise cut that explains why orchestrating LLMs matters more than the model itself.

From a 450-Line Skill to a Fleet-Wide Pipeline

Cloudflare started with a ~450-line security-audit skill that ran a 7-phase session on a single repo. Three parallel recon agents wrote an architecture.md, one Hunter ran per-class attacks, adversarial validators tried to disprove each finding, and a fresh agent re-verified survivors. It worked, but a single run only found about half the bugs multiple runs would catch.

Three walls appeared fast: context exhaustion after an hour, crashes that wiped hours of work, and zero cross-repo visibility. Cloudflare broke through by externalizing state into a SQLite database keyed by (run_id, repo, stage), making the LLM a stateless compute engine. Every stage writes immediately, so a crash costs only the task in flight. The mapping from skill to harness was nearly one-to-one: Recon, Hunt, Validate, Gapfill, Dedup, Trace, Feedback, and Report stages now run as a continuous producer-consumer loop over 128 repos.

Two-Stage Architecture: Discovery and Validation

The system splits into the Vulnerability Discovery Harness (VDH) and the Vulnerability Validation System (VVS). VDH uses one model for hunting; VVS uses a completely different model for judgment. Forcing Model B to evaluate Model A's output ensures an adversarial, unbiased third-party check. No model can grade its own homework.

Dynamic threat modeling happens during Recon: the agent writes a custom taxonomy for each repo, inventing attack classes beyond the ten built-in ones (injection, memory corruption, timing side channels, etc.). Hunters don't just read code; they compile fragments inside a sandbox (using unshare) and crash binaries to prove exploitability. When a Hunter needs a tool it doesn't have, it writes to a central wishlist (25,472 writes across 128 repos). One example: "I need a FreeBSD VM to confirm this PoC end-to-end."

Deduplication scales at O(N^2) if you use an LLM naively. Cloudflare's deterministic code builds inverted indexes over files, functions, trust boundaries, and rare tokens to generate a short candidate list. Only then does a Dedup agent reason over that list. Stable cross-run keys prevent duplicate spawning. So far 5,442 findings were folded as duplicates.

Costs, Metrics, and Real-World Impact

The harness budgets per repo with a strict task cap and a worker pool of 50-200 workers. A full scan of a complex repo (~30k lines) takes 3-4 hours to produce ~100 raw candidates, then 3 more hours for dedup and judgment compressing them to ~80 distinct bugs. The automated Fixer processes each bug in 5 minutes, writing a functional patch and regression test. Total time from discovery to pull request: about 14 hours. Critical flaws are fast-tracked to human review and patched in production within 5 days; lower-urgency bugs roll out over 15-20 days.

Cloudflare is releasing the initial skill on GitHub (github.com/cloudflare/security-audit-skill) as a starting point for anyone building their own harness. The orchestration layer, not the model, is what lasts.

Source: Build your own vulnerability harness
Domain: blog.cloudflare.com

Cloudflare's AI Pipeline Filters 20,799 Security Leads Down to 7,245 Verified Fixes

From a 450-Line Skill to a Fleet-Wide Pipeline

Two-Stage Architecture: Discovery and Validation

Costs, Metrics, and Real-World Impact

More in Systems Engineering