Source linked

Netflix crée une carte de service en temps réel pour réduire le temps d'incident

netflixtechblog.com@systems_wire3 days ago·Systems Engineering·14 comments

Une topologie en direct de 10 000 microservices permet aux ingénieurs de voir instantanément les dépendances, réduisant les interruptions de 3 heures du matin de minutes à secondes.

netflixservice topologyebpfobservabilitymicroservicesgraph database

Netflix monitors 10,000 microservices, but a single 3‑am outage can ripple across the entire stack.

Why a Unified Map Matters

At Netflix, a user pressing play triggers a cascade of calls: authentication, recommendation, encoding, playback, and more. When one node flares, the ripple can reach hundreds of downstream services. Engineers routinely ask: Which services depend on which? What’s the blast radius? Is the fault upstream or downstream? Traditional metrics, logs, and traces answer parts of the puzzle but never the whole picture. A live, unified topology gives instant context, turning a 30‑minute investigation into a 5‑minute one.

Three Sources of Truth

Netflix built Service Topology from three independent graphs, each answering a different question.

  1. eBPF Network Flows – Kernel‑level flow records capture every TCP/UDP connection. Coverage is 100 % of traffic, regardless of instrumentation. The downside? No application context.
  2. IPC Metrics – Instrumented services emit gRPC, GraphQL, or REST call metrics. These reveal endpoint‑level details and error rates, but only for services that emit metrics.
  3. Distributed Tracing – Traces follow individual requests, showing real‑time call paths. Sampling limits visibility of rare flows, yet it exposes actual runtime behavior.

By storing each layer in a separate graph partition and merging them on demand, Netflix achieves sub‑second traversal even when combining all three layers.

Building a Living Graph

Flow logs arrive from multi‑region Kafka and are processed by Apache Pekko Streams. A three‑stage aggregation pipeline resolves intermediaries (load balancers, NAT gateways) and reconstructs direct app‑to‑app edges. The final graph is persisted in Netflix’s custom graph database, built atop a distributed key‑value store for high‑throughput traversal.

A gRPC API exposes the topology. Engineers can query multi‑hop paths, filter by availability tier or business domain, and paginate large result sets. Programmatic access powers automated blast‑radius calculators and incident‑response bots.

The map updates in real time: new dependencies appear as traffic flows, stale edges fade when calls stop, and health status overlays reflect live service health. Engineers can also time‑travel, querying the graph at any past window to see how dependencies changed around an incident.

What Engineers Get Today

  • Visual dependency graphs with health overlays.
  • Quick navigation to logs, traces, and metrics.
  • Programmatic blast‑radius queries.
  • Historical snapshots for root‑cause analysis.

By turning a static diagram into a living, queryable knowledge graph, Netflix turns infrastructure plumbing into a first‑class observability asset, keeping members watching with minimal interruption.


Future posts will dive into the engineering challenges of ingesting millions of flow records per second, handling Kafka lag, and debugging reactive streams at Netflix scale.


Source: From Silos to Service Topology: Why Netflix Built a Real-Time Service Map
Domain: netflixtechblog.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.