Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart

Autor: Masoud Gholami, Florian Schintke
Rok vydání: 2021
Předmět:
Zdroj: IPDPS
DOI: 10.1109/ipdps49936.2021.00036
Popis: Checkpoint/restart (C/R) makes large-scale parallel jobs resilient against multiple node failures but typically takes considerable time and storage space. Efficient C/R strategies try to gain high levels of fault-tolerance while keeping the involved I/O and computation low. By combining XOR and partner checkpointing, two relatively weak C/R strategies, we develop and evaluate a stable, scalable, and fast C/R approach (including initialization, checkpointing, version consensus, and recovery mechanisms) that outperforms other C/R methods such as Reed-Solomon checkpointing in terms of stability and performance.
Databáze: OpenAIRE