A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks

Autor: Adrián Castelló, Mar Catalan, Sergio Barrachina, Jose I. Mestre, Manuel F. Dolz
Rok vydání: 2021
Předmět:
Zdroj: IPDPS Workshops
DOI: 10.1109/ipdpsw52791.2021.00110
Popis: We present PyDTNN, a framework for training deep neural networks (DNNs) on clusters of computers that has been designed as a research-oriented tool with a low learning curve. Our parallel training framework offers a set of functionalities that cover several must-have features for advanced deep learning (DL) software: 1) it is developed in Python in order to expose an accessible entry point for the newcomer; 2) it is extensible, allowing users to prototype new research ideas without requiring them to deal with complex software-stacks; and 3) it delivers high parallel performance, exploiting MPI via mpi4py/NCCL for communication; and NumPy, cuDNN, and cuBLAS for computation.This paper provides practical evidence that PyDTNN attains similar accuracy and parallel performance to those exhibited by Google’s TensorFlow (TF), though we recognize that PyDTNN cannot compete with a production-level framework such as TF or PyTorch in terms of maturity and functionality. Instead, PyDTNN is designed as an accessible and customizable tool for prototyping ideas related to distributed training of DNN models on clusters.
Databáze: OpenAIRE