Dynamic Task Replication Slashes Silent Data Corruption Overhead to 0.5%

Replicating tasks with dynamic data dependencies to guard against silent data corruptions (SDCs) typically costs more than 2x overhead—but the ItoyoriFBC runtime clocks in at less than 2x failure-free time and just 0.5% per SDC for correction.

Why Static Replication Falls Short for Dynamic Tasks

Most existing SDC protection schemes assume static tasks and fixed dependencies. That works for bulk-synchronous parallelism, but fails for Asynchronous Many-Task (AMT) runtimes where tasks spawn at runtime, communicate through C++11-like promises/futures, and use conditional touches with work stealing across clusters.

Tracking every input and output for comparison in a dynamic DAG is expensive. Naïve full replication doubles compute time and still leaves the problem of when to compare results and how to recover only the affected tasks.

Cross-Validation at the Runtime Boundary Keeps Overhead Low

The authors of the ItoyoriFBC runtime propose a tightly coupled approach: original and replica computations cross-validate all outgoing effects when they interact with the runtime. Instead of comparing final task results, they validate every side effect—each future write, each task spawn—against its replica. This catches corruption early and enables selective recomputation of only the tasks that were actually affected by an SDC.

Only corrupted tasks get recomputed, not entire subtrees. Work stealing already present in the runtime naturally balances the extra replica tasks, turning a cost into an opportunity: more tasks means better load balancing.

Benchmarks: Fibonacci and H-Matrix LU Decomposition

Preliminary experiments using Fibonacci (a stress test for dynamic spawning) and emulated $\mathcal{H}$-matrix LU decomposition (a realistic dense linear algebra kernel) confirm the overhead. Failure-free running times increased by less than a factor of two, despite full replication. The correction overhead per SDC hovered around 0.5% of total runtime.

That 0.5% is the key number: it means you can sustain multiple SDCs per execution without significant performance degradation. The approach doesn't require checkpointing or rollback of unrelated work.

If these numbers hold for larger, real-world workloads—finite element solvers, particle simulations, sparse linear algebra—this scheme could become the default SDC protection for AMT systems running on exascale clusters where silent errors are a near-certainty.

Source: Protecting Futures against Silent Data Corruption -- Efficient Task Replication for Dynamic Data Dependencies
Domain: arxiv.org

Dynamic Task Replication Slashes Silent Data Corruption Overhead to 0.5%

Why Static Replication Falls Short for Dynamic Tasks

Cross-Validation at the Runtime Boundary Keeps Overhead Low

Benchmarks: Fibonacci and H-Matrix LU Decomposition

More in Systems Engineering