Popis: |
Checkpoint/restart (C/R) makes large-scale parallel jobs resilient against multiple node failures but typically takes considerable time and storage space. Efficient C/R strategies try to gain high levels of fault-tolerance while keeping the involved I/O and computation low. By combining XOR and partner checkpointing, two relatively weak C/R strategies, we develop and evaluate a stable, scalable, and fast C/R approach (including initialization, checkpointing, version consensus, and recovery mechanisms) that outperforms other C/R methods such as Reed-Solomon checkpointing in terms of stability and performance. |