Optimizing checkpointing techniques for machine learning frameworks

Autor: Perelló Bacardit, Marc
Přispěvatelé: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Cristal Kestelman, Adrián, Bautista Gomez, Leonardo
Jazyk: angličtina
Rok vydání: 2023
Předmět:
Popis: While most deep learning frameworks provide mechanisms to checkpoint models, their implementation is naive and based on the assumption of single machine training. While this implementation can be adequate at a small scale, it becomes progressively inefficient when further scaling out the training. Since large models are often trained at large scale in HPC environments, we need checkpoint procedures which make efficient use of the available resources. Several checkpoint techniques and libraries have been designed for HPC environments, in which local storage is leveraged in order to alleviate the I/O bottleneck from writing to a parallel file system. However, deep learning training has very different data requirements than most HPC applications. In order to solve this problem, we develop DeepPart, a python module that provides optimizations to distribute shared checkpoint data across processes, effectively transforming it into a HPC-like checkpoint procedure. DeepPart sends the resulting distributed data to FTI, a multi-level HPC checkpoint library which leverages local storage. We implement an heuristic algorithm to distribute the elements of a collection across processes while minimizing the computational cost. We devise a method for automatically choosing the best sub-collection to partition, referred as partition candidate, independently of the specific structure of the main collection passed to checkpoint. Additionally, we allow individual elements to be partitioned between two or more processes if our algorithm detects size imbalance. We allow sub-collections to be recursively partitioned, proportionally to their size, in order to efficiently partition non-trivial collection structures. We show that the computational cost of our approach behaves similarly to an embarrassingly parallel workload, and achieves close to ideal speed-ups with up to 16 nodes and 4 processes per node. With a model size of 20GB, we observe overall gains of 5.6x compared to a standard PyTorch checkpoint implementation. Using BERT-LARGE model, we obtain a checkpoint speed-up of 2.1x and 2.7x without compression and with distributed lossless compression respectively compared to a standard PyTorch checkpoint approach. In order to understand model serialization cost, we perform an analysis using several model data sizes with different number of tensors. Our findings show that the serialization cost of a model is dependant on the relation between the total model size and the number of tensors, and is optimal when being above a lower threshold and below an upper threshold. As such, when distributing model data across processes, it is important to reduce both the total size and the number of tensors on each process proportionally in order to minimize serialization cost. Using a simulator we designed, we show how our data distribution approach scales with very large model and optimizer structures based on the large variant of BERT model in different configurations.
Databáze: OpenAIRE