Source linked

Dynamic Task Replication Slashes Silent Data Corruption Overhead to 0.5%

Silent data corruptions threaten correctness at scale; a new replication scheme in the ItoyoriFBC runtime keeps failure-free overhead under 2x and correction cost at 0.5% per event.

itoyorifbcsilent data corruptionasync many tasktask replicationh matrix lu decompositionruntime systems

Replicating tasks with dynamic data dependencies to guard against silent data corruptions (SDCs) typically costs more than 2x overhead—but the ItoyoriFBC runtime clocks in at less than 2x failure-free time and just 0.5% per SDC for correction.

Why Static Replication Falls Short for Dynamic Tasks

Most existing SDC protection schemes assume static tasks and fixed dependencies. That works for bulk-synchronous parallelism, but fails for Asynchronous Many-Task (AMT) runtimes where tasks spawn at runtime, communicate through C++11-like promises/futures, and use conditional touches with work stealing across clusters.

Tracking every input and output for comparison in a dynamic DAG is expensive. Naïve full replication doubles compute time and still leaves the problem of when to compare results and how to recover only the affected tasks.

Cross-Validation at the Runtime Boundary Keeps Overhead Low

The authors of the ItoyoriFBC runtime propose a tightly coupled approach: original and replica computations cross-validate all outgoing effects when they interact with the runtime. Instead of comparing final task results, they validate every side effect—each future write, each task spawn—against its replica. This catches corruption early and enables selective recomputation of only the tasks that were actually affected by an SDC.

Only corrupted tasks get recomputed, not entire subtrees. Work stealing already present in the runtime naturally balances the extra replica tasks, turning a cost into an opportunity: more tasks means better load balancing.

Benchmarks: Fibonacci and H-Matrix LU Decomposition

Preliminary experiments using Fibonacci (a stress test for dynamic spawning) and emulated $\mathcal{H}$-matrix LU decomposition (a realistic dense linear algebra kernel) confirm the overhead. Failure-free running times increased by less than a factor of two, despite full replication. The correction overhead per SDC hovered around 0.5% of total runtime.

That 0.5% is the key number: it means you can sustain multiple SDCs per execution without significant performance degradation. The approach doesn't require checkpointing or rollback of unrelated work.

If these numbers hold for larger, real-world workloads—finite element solvers, particle simulations, sparse linear algebra—this scheme could become the default SDC protection for AMT systems running on exascale clusters where silent errors are a near-certainty.


Source: Protecting Futures against Silent Data Corruption -- Efficient Task Replication for Dynamic Data Dependencies
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.