A Methodology for Soft Errors Detection and Automatic Recovery

Montezanti, Diego; De Giusti, Armando Eduardo; Naiouf, Marcelo; Villamayor, Jorge; Rexachs, Dolores; Luque, Emilio

Documento de conferencia

Embargado

A Methodology for Soft Errors Detection and Automatic Recovery

Montezanti, Diego

|

De Giusti, Armando Eduardo

|

|

|

|

Fecha de publicación

2017

Lugar de desarrollo

Instituto de Investigación en Informática

Centro CIC

Instituto de Investigación en Informática

Evento

International Conference on High Performance Computing & Simulation HPCS (Genoa, 2017)

Nombre del evento

International Conference on High Performance Computing & Simulation HPCS (Genoa, 2017)

Idioma

Inglés

Materia

Ingenierías y Tecnologías

Extensión

8 p.

HDL 11746/8584

DOI 10.1109/HPCS.2017.71

Descargas

Documento Completo (273.63 KB)

Enlace externo

Recurso Completo

Resumen

Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.

Palabras clave

soft error detection

automatic recovery

systemlevel checkpoint

user-level checkpoint

Esta obra se publica con la licencia Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (BY-NC-SA 4.0)

Página completa del ítem

A Methodology for Soft Errors Detection and Automatic Recovery

Título alternativo

Título de investigación

Directores

Compiladores

Editores

Editorial

Fecha de publicación

Descripción

Emisor del título

Lugar de desarrollo

Centro CIC

Libro/Informe

Recursos relacionados

Evento

Nombre del evento

Idioma

Materia

Area temática

Clasificación FORD

Cobertura Espacial

Extensión

Descargas

Enlace externo

Resumen

Palabras clave

item.page.license