Source linked

Как Netflix использует динамическое разделение, чтобы решить латентность широкого ряда в Кассандра

netflixtechblog.com@systems_wire2 hours ago·Systems Engineering·1 comments

Внедряя асинхронный трубопровод для обнаружения и разделения широких разделов на уровне ID, Netflix сократил задержки чтения хвоста с нескольких секунд до менее 200 мс.

netflixapache cassandratime seriessystems engineeringdistributed systems

Wide partitions in time-series datasets can drive read latencies from single-digit milliseconds into the realm of several seconds, causing catastrophic tail latency and thread queueing in Cassandra clusters.

The failure of static provisioning

Netflix's TimeSeries abstraction relies on Apache Cassandra 4.x to ingest petabytes of temporal data. While the system uses Monte Carlo simulations to provision optimal partition configurations during dataset creation, static provisioning fails when workloads evolve or data outliers emerge. A small percentage of TimeSeries IDs often receive a vastly higher volume of events than the rest, creating "wide rows" that bypass the intended partitioning strategy.

Initially, Netflix addressed this through Time Slice re-partitioning, which adjusts the partitioning strategy for entire tables based on observed partition density. However, this approach is ineffective when only a subset of IDs within a table are problematic. In such cases, the system must move beyond table-level adjustments to granular, ID-level intervention.

Automating ID-level partition splits

To handle individual outliers, Netflix built a dynamic partitioning pipeline that operates asynchronously across three stages: detection, planning/splitting, and serving reads.

Detection occurs on the read path. When a server detects that a partition exceeds a specific byte threshold, it emits a detection event to Kafka. This ensures that the system only spends resources on partitions that are actually causing performance issues. The pipeline focuses on immutable partitions to reduce complexity, though the architecture is designed to eventually support mutable data.

Once detected, a planner computes an optimal split strategy—such as assigning more event buckets to a single time bucket—and executes the split. To ensure data integrity, the system uses a strict checksum validation process: the planner stores a pre-split checksum, and the splitter must match it with a post-split checksum before the operation is marked as complete.

Transparently routing reads via Bloom filters

Serving reads from split partitions must be invisible to the caller to avoid introducing new latency overhead. Netflix achieves this by loading the partition keys of completed splits into in-memory Bloom filters on the TimeSeries servers.

Every read operation checks these Bloom filters, which typically incurs a latency penalty of only a few microseconds. If a hit occurs, the server transparently diverts the query to the new, smaller partitions. This diversion is backed by a read-through cache for metadata, ensuring that even when a split is detected, the performance impact remains minimal.

This dynamic approach has transformed the performance profile of Netflix's time-series workloads. Average read latencies for wide partitions have dropped from seconds to low double-digit milliseconds, and tail latencies have stabilized at 200ms or better, enabling the system to query even 500MB+ partitions while remaining highly available.


Source: Dynamic Repartitioning for Time Series Workloads
Domain: netflixtechblog.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.