Distributed Sign Momentum with Local Steps for Training Transformers

Autor:	Yu, Shuhua, Zhou, Ding, Xie, Cong, Xu, An, Zhang, Zhi, Liu, Xin, Kar, Soummya
Rok vydání:	2024
Předmět:	Computer Science - Machine Learning
Druh dokumentu:	Working Paper
Popis:	Pre-training Transformer models is resource-intensive, and recent studies have shown that sign momentum is an efficient technique for training large-scale deep learning models, particularly Transformers. However, its application in distributed training or federated learning remains underexplored. This paper investigates a novel communication-efficient distributed sign momentum method with local updates. Our proposed method allows for a broad class of base optimizers for local updates, and uses sign momentum in global updates, where momentum is generated from differences accumulated during local steps. We evaluate our method on the pre-training of various GPT-2 models, and the empirical results show significant improvement compared to other distributed methods with local updates. Furthermore, by approximating the sign operator with a randomized version that acts as a continuous analog in expectation, we present an $O(1/\sqrt{T})$ convergence for one instance of the proposed method for nonconvex smooth functions. Comment: 23 pages, 21 figures
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2411.17866 Zobrazit plný text záznamu View this record from Arxiv