Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud

Autor: Kyeonglok Kim, Hyeonsu Lee, Seungmin Oh, Euiseong Seo
Jazyk: angličtina
Rok vydání: 2022
Předmět:
Zdroj: IEEE Access, Vol 10, Pp 68468-68481 (2022)
Druh dokumentu: article
ISSN: 2169-3536
DOI: 10.1109/ACCESS.2022.3184692
Popis: In order to cope with the growing scale of deep neural network (DNN) models and training data, the use of cloud computing for distributed DNN training is becoming increasingly popular. The amount of available resources in a cloud continuously changes according to users’ demands. Although distributed DNN training has a long execution time ranging from several hours to several days, existing frameworks cannot provide a dynamic scale function or have high scale in/out overhead. Therefore, it is difficult to achieve higher performance by adding graphics processing unit (GPU) nodes to a running training cluster, even when surplus GPU resources become available. In addition, the inability to dynamically reconfigure the training cluster prohibits the reform of the cluster topology when it was sub-optimally created. This paper proposes a dynamic scaling technique with which the dynamic addition and removal of new workers can be performed without suspending the ongoing training job. In addition, we propose a heterogeneity-aware straggler-proof technique so that, even when the performance of the GPUs in the cloud are uneven, a performance benefit can be guaranteed through the addition of the surplus resources. The proposed scheme improved throughput by up to a factor of 17.52 during scaling out the existing cluster of five workers to ten compared to the existing checkpoint-based scheme. Furthermore, training was continued at 95.52% of the maximum performance while being stopped for 841 seconds in Elastic Horovod, which supports dynamic scaling. Finally, even when GPUs of different performances were mixed, the error between the determined batch size and the optimal batch size was 3.37% on average.
Databáze: Directory of Open Access Journals