Towards Aggregated Asynchronous Checkpointing
Autor: | Gossman, Mikaila J., Nicolae, Bogdan, Calhoun, Jon C., Cappello, Franck, Smith, Melissa C. |
---|---|
Rok vydání: | 2021 |
Předmět: | |
Druh dokumentu: | Working Paper |
Popis: | High-Performance Computing (HPC) applications need to checkpoint massive amounts of data at scale. Multi-level asynchronous checkpoint runtimes like VELOC (Very Low Overhead Checkpoint Strategy) are gaining popularity among application scientists for their ability to leverage fast node-local storage and flush independently to stable, external storage (e.g., parallel file systems) in the background. Currently, VELOC adopts a one-file-per-process flush strategy, which results in a large number of files being written to external storage, thereby overwhelming metadata servers and making it difficult to transfer and access checkpoints as a whole. This paper discusses the viability and challenges of designing aggregation techniques for asynchronous multi-level checkpointing. To this end we implement and study two aggregation strategies, their limitations, and propose a new aggregation strategy specifically for asynchronous multi-level checkpointing. Comment: Accepted submission to the SuperCheck Workshop at the SuperComputing Conference held in St. Louis, MO. November 14-19, 2021(SC'21) |
Databáze: | arXiv |
Externí odkaz: |