Cloudflare Achieved 10x Security Scanning - No New Partitions Needed

Our scanning system was doing 10 scans per second and choking on its own backlog. After the rework, it sustains over 120 scans per second at peak — a 10x increase without adding more Kafka partitions or API pods.

The Bottleneck Wasn't Where We Expected

Cloudflare’s Security Insights system scans every account, zone, and DNS record for misconfigurations. Two problems: scans happened every week or two, and free accounts were opt-in. To fix both, we needed 10x throughput — from 10 to 100 scans per second. The existing system was already failing: millions of events backlogged, API timeouts, crashing processes.

We started with Kafka. Raw partitions limit parallelism: one consumer per partition per consumer group. Adding partitions would tax the shared broker, so we looked elsewhere. The immediate fix was batching. Each checker now consumes messages in batches and processes them in separate goroutines. Acceptable trade-offs: slightly higher memory and redo work on crash. That got us partway.

Two Lanes, No Waiting: Parallelism and Head-of-Line Blocking

Some accounts have far more assets than others, turning a millisecond scan into a multi-minute slog. That blocked the consumer from moving to the next message. We split each checker into two consumer groups: a fast lane and a slow lane. The fast lane skips messages that look heavy; the slow lane handles them with dedicated resources. Head-of-line blocking vanished.

Database writes were the next drag. Every insight hit a single API endpoint that looped over each insight and ran an INSERT ... ON CONFLICT DO UPDATE. With up to 500,000 insights per call, that’s half a million round trips to Postgres. The classic COPY into a temp table caused system table bloat. We landed on a hybrid: UNNEST for small batches, COPY for large ones. Inserts that took minutes now complete in seconds.

The 50ms Latency Tax That Crippled Throughput

We noticed something odd: API calls from Amsterdam to our primary database in Portland averaged 3 seconds vs. 10 ms in Portland. The root cause was active-active load balancing. Half the checker processes got routed to Amsterdam, suffered 50ms round-trip light-speed, exhausted the connection pool, and caused client-side timeouts. Kafka lag per partition showed exactly 15 of 30 partitions falling behind — the ones tied to Amsterdam-bound processes.

Switching the API to active-passive fixed it overnight. The active instance now lives in Portland, colocated with the database. Latency dropped, connection pool freed up, throughput recovered.

Scheduling Without the Spikes

The scheduler used fixed periodic intervals, causing massive spikes — hundreds of thousands of scans triggered within minutes. We untangled accounts from zones (each zone gets its own last_scheduled_at), randomized existing timestamps, and added adaptive rate limiting. The rate limit recalculates every 30 minutes based on total accounts and scan intervals, ensuring scans stay uniformly distributed even as the customer base grows.

Today, Security Insights sustains over 120 scans per second during peak scheduling. Free accounts scan every 7 days, Pro/Business every 3, Enterprise daily. The system is stable enough to support granular on-demand scans — a feature we couldn't have built before.

Source: Scaling Security Insights: how we achieved a 10x increase in global scanning capacity
Domain: blog.cloudflare.com