Documento de conferencia

A Methodology for Soft Errors Detection and Automatic Recovery


Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.

Palabras clave
soft error detection
automatic recovery
systemlevel checkpoint
user-level checkpoint

Esta obra se publica con la licencia Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (BY-NC-SA 4.0)
Imagen en miniatura