swFLOW: A Dataflow Deep Learning Framework on Sunway TaihuLight Supercomputer

Autor: Jose Monsalve Diaz, Mingfan Li, Guang R. Gao, Han Lin, Lin Zeng, Hong An
Rok vydání: 2019
Předmět:
Zdroj: HPCC/SmartCity/DSS
DOI: 10.1109/hpcc/smartcity/dss.2019.00345
Popis: Deep learning technology is widely used in many modern fields and a number of deep learning models and software frameworks have been proposed. However, it is still very difficult to process deep learning tasks efficiently on traditional high performance computing (HPC) systems with specialized architectures such as Sunway TaihuLight. In this paper, we propose swFLOW: a TensorFlow-based dataflow deep learning framework on Sunway TaihuLight. Based on the performance analysis results on convolutional neural network (CNN), we optimize the convolution layer, reduce the data layout transpose operation and get 10.42x speedup compared to single management processing element (MPE) version. As for distributed training, we use elastic averaging stochastic gradient descent (EASGD) algorithm to reduce communication and use data prefetch to avoid data fetch being a performance bottleneck. On 512 processes, we get a parallel efficiency of 81.01% with communication period τ = 8. Limited by the maximal executable batch size, the current performance of swFLOW is far from optimal. It is very necessary to further optimize using technology like remote direct memory access (RDMA) and model parallelism.
Databáze: OpenAIRE