A Methodology for Soft Errors Detection and Automatic Recovery

cic.institucionOrigenInstituto de Investigación en Informáticaes
cic.isFulltexttruees
cic.isPeerReviewedtruees
cic.lugarDesarrolloInstituto de Investigación en Informáticaes
cic.versioninfo:eu-repo/semantics/publishedVersiones
dc.date.accessioned2018-11-14T11:50:55Z
dc.date.available2018-11-14T11:50:55Z
dc.identifier.urihttps://digital.cic.gba.gob.ar/handle/11746/8584
dc.titleA Methodology for Soft Errors Detection and Automatic Recoveryen
dc.typeDocumento de conferenciaes
dcterms.abstractHandling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.en
dcterms.creator.authorMontezanti, Diegoes
dcterms.creator.authorDe Giusti, Armando Eduardoes
dcterms.creator.authorNaiouf, Marceloes
dcterms.creator.authorVillamayor, Jorgees
dcterms.creator.authorRexachs, Doloreses
dcterms.creator.authorLuque, Emilioes
dcterms.extent8 p.es
dcterms.identifier.otherDOI:10.1109/HPCS.2017.71es
dcterms.identifier.urlRecurso Completoes
dcterms.isPartOf.issueInternational Conference on High Performance Computing & Simulation HPCS (Genoa, 2017)es
dcterms.isPartOf.seriesInternational Conference on High Performance Computing & Simulation HPCS (Genoa, 2017)es
dcterms.issued2017
dcterms.languageIngléses
dcterms.licenseAttribution-NonCommercial-ShareAlike 4.0 International (BY-NC-SA 4.0)es
dcterms.subjectsoft error detectionen
dcterms.subjectautomatic recoveryen
dcterms.subjectsystemlevel checkpointen
dcterms.subjectuser-level checkpointen
dcterms.subject.materiaIngenierías y Tecnologíases

Archivos

Bloque original
Mostrando 1 - 1 de 1
No hay miniatura disponible
Nombre:
Montezanti - A Methodology for Soft Errors Detection a.pdf-PDFA.pdf
Tamaño:
273.63 KB
Formato:
Adobe Portable Document Format
Descripción:
Documento Completo