Netflix splits dynamisch breite Cassandra-Partitionen, um Tail-Latency von Sekunden zu Millisekunden zu verringern

Q: What is the significance of: Netflix splits dynamisch breite Cassandra-Partitionen, um Tail-Latency von Sekunden zu Millisekunden zu verringern?

Das TimeSeries-Team von Netflix baute eine asynchrone Pipeline auf, die breite Partitionen automatisch erkennt und sie pro ID aufteilt, wodurch die Leseschwanzlatenz von Sekunden auf etwa 200ms reduziert wird und die Lesezeit reduziert wird.

Netflix cut read latency for wide partitions from seconds to around 200ms by dynamically splitting partitions per ID — not by throwing more Cassandra nodes at the problem. Rajiv Shringi, Kaidan Fullerton, Oleksii Tkachuk, and Kartik Sathyanarayanan detailed the architecture on the Netflix Tech Blog, and it's worth a read if you run time-series workloads on any key-value store.

Wide Partitions Were Killing Tail Latency

Netflix's TimeSeries Abstraction ingests petabytes of temporal event data with millisecond latency, backed by Cassandra 4.x. Most reads hit single-digit milliseconds. But a small percentage of time-series IDs — think a popular profile ID or a noisy sensor — accumulate so many events per partition that the p99 read latency spikes to seconds. Read timeouts climb, GC pauses appear, CPU cranks, thread queues back up.

Scaling up Cassandra helps, but that's expensive. The team needed smarter automation.

Two Solutions: Time Slice Re-Partitioning and Per-ID Splitting

First approach: a background worker monitors Cassandra's nodetool tablehistograms and adjusts the time-bucket interval for future time slices. If partitions are under 2 MiB or over 10 MiB, the worker recomputes the ideal bucket size and applies it to new slices. This cut tail latency and thread queueing — but it only works when the whole dataset drifts. It can't fix a handful of hot IDs.

For those hot IDs, the team built a full async pipeline that detects, plans, splits, and serves partitions on a per-ID basis. Crucially, they operate only on immutable partitions (no longer receiving writes) to reduce complexity.

Dynamic Partitioning Pipeline: Detect, Plan, Split, Serve

Detection happens on the read path. Every read tracks bytes read per partition. Exceed a threshold? An event hits Kafka. The planner then reads the entire partition from scratch (with checkpointing) and computes a split strategy — usually assigning more event buckets to the same time bucket. Pre- and post-split checksums must match before the split is marked completed. The original wide partition is never deleted, providing a safe fallback.

Serving reads uses in-memory Bloom filters. Each TimeSeries server loads the partition keys of completed splits into a filter. On every read, the filter check takes single-digit microseconds. On a hit, the server looks up metadata (with a read-through cache) and reads N smaller partitions in parallel instead of one giant one. The latency penalty is invisible to callers.

Results: Sub-200ms Tail Latency and Stable Clusters

After the pipeline went live, average read latency for previously wide partitions dropped from seconds to low double-digit milliseconds. Tail latency settled around 200ms. Read timeouts nearly vanished. Cassandra CPU utilization flattened, thread queueing hit zero. Even a 500MB+ partition could be paginated and served in ~41 seconds — unavailable before — while the cluster stayed stable.

The team validated correctness with offline Spark jobs comparing split data to original, and rolled out through phases: shadow mode, comparison mode, then full production. The blog post includes real charts, thresholds, and code snippets.

More work is planned — splitting mutable partitions and reprocessing failed splits — but this already proves you don't need a new storage engine to fix hot partitions. A careful async pipeline with Bloom filters and checksum validation buys back stability without burning hardware.

Source: Dynamic Repartitioning for Time Series Workloads
Domain: netflixtechblog.com

Netflix splits dynamisch breite Cassandra-Partitionen, um Tail-Latency von Sekunden zu Millisekunden zu verringern

More in Systems Engineering