Source linked

Netflix、低遅延グラフエンジンで毎秒1000万回のオプションを処理

netflixtechblog.com@systems_wire6 days ago·Systems Engineering·14 comments

Netflix の新しい Graph Abstraction は、650 TB のデータを介して秒あたり約 10 万件の操作を処理し、ノードの読み取りと 1 ホップの横断のための 1 桁のミリ秒の遅延を提供します。

netflixgraph abstractionkey valueevcacheproperty graphhigh throughput

Netflix’s Graph Abstraction now handles almost 10 million operations per second across 650 TB of data, delivering sub‑millisecond latencies for node reads and 1‑hop traversals.

Architecture Overview

The engine sits on top of Netflix’s existing KV abstraction, using a property‑graph model to store nodes and edges with strongly typed properties. Each namespace maps to a KV table; node types live in separate namespaces, while edges are split into forward and reverse link indexes plus a separate property index. This separation lets edge links be upserted without pulling the entire property set, preventing wide rows in the underlying store.

Write‑aside caching of edge links keeps the write amplification low: a short‑lived cache avoids re‑writing a link that already exists. Read‑aside caching leverages EVCache; the system invalidates records on write for consistency‑critical graphs or uses TTL‑driven invalidation for high‑frequency updates. The KV layer guarantees idempotent writes via a timestamp‑based LWW policy.

Performance & Scaling

Edge and node persistence achieve single‑digit millisecond latencies (p99 shown in red, p90 in orange, p50 in green). 1‑hop traversals run in the same time window, while 2‑hop traversals—used by the Real‑Time Distributed Graph (RDG)—reach up to 100 ms but keep p90 under 50 ms. The Count API can perform high‑rate counting traversals with comparable latencies. Asynchronous node deletions finish in sub‑second time, and the system replicates data across regions for eventual consistency.

Consistency & Caching

Because writes touch multiple KV namespaces, the abstraction uses Kafka‑driven retries to repair entropy. Node deletions are asynchronous; LWW conflict resolution guarantees that concurrent updates don’t corrupt the graph. Global replication of both cache and durable storage yields an eventually consistent system, acceptable for the OLTP use cases that drive Netflix’s real‑time services.

Netflix will continue to refine the traversal planner and explore write‑through caching in future releases, but the current design already powers millions of operations per second while keeping latency low enough for real‑time user experiences.


Source: High-Throughput Graph Abstraction at Netflix: Part I
Domain: netflixtechblog.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.