Source linked

Netflix Slashes Cassandra Read Latency From Seconds to Milliseconds

netflixtechblog.com@systems_wire2 hours ago·Systems Engineering·2 comments

After detecting wide partitions stalling reads for seconds, Netflix built an async pipeline that splits them per ID - dropping tail latency to ~200 ms and letting them query 500MB+ partitions without crashing.

netflixcassandratime seriesdistributed systemssystems engineering

Netflix engineers drove average read latency for wide Cassandra partitions from seconds down to low double-digit milliseconds — and they did it without throwing more hardware at the problem.

That improvement came from a dynamic repartitioning pipeline that splits oversized partitions at the individual ID level, not the table level. Before this, some datasets saw tail latencies in the seconds range, read timeouts spiking, and even GC pauses and thread queueing. The fix is now running in production.

Wide Partitions Were Stalling Reads

Netflix’s TimeSeries Abstraction stores petabytes of temporal event data in Cassandra 4.x. Each dataset divides data into time slices, time buckets, and event buckets. But some IDs accumulate vastly more events than others — a classic data skew problem. A small percentage of IDs can receive orders of magnitude more writes, bloating single partitions.

Standard Cassandra introspection (nodetool tablehistograms) shows partition sizes. When a partition grows beyond a configured density — typically between 2 MiB and 10 MiB — read amplification and thread queueing kick in. Scaling up the cluster is always an option, but Netflix wanted a smarter, cheaper approach.

Two-Stage Attack: Table-Level First, Then Per-ID

First, a background worker monitors Cassandra partition histograms and computes an adjustment factor. If partitions are too small (under-partitioning leads to read amplification) or too large, the worker updates future time slices with a new time_bucket interval. For example, changing from 60-second buckets to 604,800-second buckets. This fixed the common case but doesn't help when only a subset of IDs are wide.

For those outliers, Netflix built a three-stage async pipeline: detection, planning & splitting, and serving reads.

Detection happens on the read path. If bytes read exceed a threshold, the server emits a Kafka event with the time slice, time series ID, bucket identifiers, and an immutability flag. Splitting is deliberately limited to immutable partitions first — partitions that are no longer receiving writes. This reduces complexity.

The planner reads the entire partition once, computes a split plan, and stores checkpoints in a wide_row metadata table. Splitting delegates to strategies like EventBucketPartitionSplitStrategy, which assigns more event buckets to the same time bucket. Pre- and post-split checksums validate correctness. Splits are marked completed only if checksums match.

Serving reads uses in-memory Bloom filters. Each read checks whether the partition key is wide; the check takes single-digit microseconds. On a hit, the server looks up the split metadata and reads N smaller partitions in parallel. The original wide partition is never deleted, providing a safe fallback. Netflix reports that the filters fit comfortably per server instance.

Results: Latency Dropped, Stability Returned

Average latency for reading wide partitions dropped from seconds to low double-digit milliseconds. Tail latency fell from several seconds to around 200 ms. Read timeouts nearly vanished. CPU utilization on Cassandra nodes dropped, and thread queueing disappeared.

One extreme case: the service paginated and queried a 500MB+ partition in about 41 seconds while remaining fully available. Previously, that dataset faced constant timeouts and unavailability blips.

Netflix also built a phased rollout with shadow-mode comparison, verifying bytes served by old and new paths matched. A Spark job further validated splits offline.

Next steps include splitting mutable partitions and reprocessing failed splits. For now, this approach proves that smart dynamic repartitioning — not just vertical scaling — can tame even the widest time series skew.


Source: Dynamic Repartitioning for Time Series Workloads
Domain: netflixtechblog.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.