Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart

Autor:	Masoud Gholami, Florian Schintke
Rok vydání:	2021
Předmět:	Computer science Multiple node Computation Scalability Stability (learning theory) Initialization Fault tolerance Limiting Parallel computing
Zdroj:	IPDPS
DOI:	10.1109/ipdps49936.2021.00036
Popis:	Checkpoint/restart (C/R) makes large-scale parallel jobs resilient against multiple node failures but typically takes considerable time and storage space. Efficient C/R strategies try to gain high levels of fault-tolerance while keeping the involved I/O and computation low. By combining XOR and partner checkpointing, two relatively weak C/R strategies, we develop and evaluate a stable, scalable, and fast C/R approach (including initialization, checkpointing, version consensus, and recovery mechanisms) that outperforms other C/R methods such as Reed-Solomon checkpointing in terms of stability and performance.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::7457b20f0f917c148fbb9b082e4710e8 https://doi.org/10.1109/ipdps49936.2021.00036 Zobrazit plný text záznamu