Source linked

Five Flaws in Microservice Anomaly Detection-and Three Fixes

BARO, EventADL, and TORAI tackle the chronic disconnect between anomaly detection and root cause analysis, while RCAEval provides a sorely needed benchmark.

baroeventadltorairceaevalmicroservicesanomaly detection

Most anomaly detection and root cause analysis (RCA) work for microservices suffers from five chronic limitations, and a new thesis lays them out plainly before offering specific, name-bound solutions.

The Five Limitations That Plague Current Approaches

First, anomaly detection and RCA are typically treated as separate stages—assuming anomalies are caught perfectly. That assumption breaks under real noise and delay. Second, the field obsesses over metrics, logs, and traces while ignoring event data like API calls and configuration changes. Third, many methods demand a pre-built service call graph, so they fail when one isn’t available. Fourth, there’s no standard dataset or evaluation framework, making apples-to-apples comparisons impossible. Fifth, causal inference–based RCA has taken over, but its actual effectiveness, efficiency, and robustness are unclear.

BARO and EventADL: End-to-End for Metrics and Events

BARO is an end-to-end approach that ingests metric data and does both anomaly detection and RCA in one shot—no separate pipeline. EventADL does the same for event data, a modality the field has largely neglected. Both were validated on real microservice systems, though the abstract doesn’t cite exact accuracy numbers.

TORAI: Root Cause Without a Service Call Graph

TORAI is a multimodal RCA framework that requires zero service call graph input. That’s a practical move: in many production environments, the call graph is stale, incomplete, or nonexistent. TORAI fuses whatever observability signals are available and still localizes the root cause.

RCAEval: The Benchmark the Field Needed

The thesis also introduces RCAEval, a benchmark that provides ready-to-use datasets and reproducible baselines. That alone could shift the field from ad-hoc evaluations to comparable results. A systematic evaluation of existing causal inference RCA methods is included, with insights that should steer future work toward more robust designs.

These contributions don’t claim to solve every problem, but they close the biggest gap: treating anomaly detection and RCA as an integrated problem, not two disconnected chores. Next step is to see how BARO and TORAI hold up in live production environments with real incident response timelines.


Source: Anomaly Detection and Root Cause Analysis for Microservice Systems
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.