Zobrazeno 1 - 10
of 21
pro vyhledávání: '"RASHIDI, SAEED"'
Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators, according to a parallelization strategy. However, high-performance compute and interconnects
Externí odkaz:
http://arxiv.org/abs/2406.19580
Autor:
Sridharan, Srinivas, Heo, Taekyung, Feng, Louis, Wang, Zhaodong, Bergeron, Matt, Fu, Wenyin, Zheng, Shengbao, Coutinho, Brian, Rashidi, Saeed, Man, Changhai, Krishna, Tushar
Benchmarking and co-design are essential for driving optimizations and innovation around ML models, ML software, and next-generation hardware. Full workload benchmarks, e.g. MLPerf, play an essential role in enabling fair comparison across different
Externí odkaz:
http://arxiv.org/abs/2305.14516
Autor:
Won, William, Heo, Taekyung, Rashidi, Saeed, Sridharan, Srinivas, Srinivasan, Sudarshan, Krishna, Tushar
As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-
Externí odkaz:
http://arxiv.org/abs/2303.14006
Autor:
Kadiyala, Divya Kiran, Rashidi, Saeed, Heo, Taekyung, Bambhaniya, Abhimanyu Rajeshkumar, Krishna, Tushar, Daglis, Alexandros
Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task requiri
Externí odkaz:
http://arxiv.org/abs/2211.16648
Autor:
Khan, Tarannum, Rashidi, Saeed, Sridharan, Srinivas, Shurpali, Pallavi, Akella, Aditya, Krishna, Tushar
RDMA over Converged Ethernet (RoCE) has gained significant attraction for datacenter networks due to its compatibility with conventional Ethernet-based fabric. However, the RDMA protocol is efficient only on (nearly) lossless networks, emphasizing th
Externí odkaz:
http://arxiv.org/abs/2207.10898
Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models
Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activat
Externí odkaz:
http://arxiv.org/abs/2110.04478
Publikováno v:
Proceedings of the 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '24)
As model sizes in machine learning continue to scale, distributed training is necessary to accommodate model weights within each device and to reduce training time. However, this comes with the expense of increased communication overhead due to the e
Externí odkaz:
http://arxiv.org/abs/2109.11762
Using multiple nodes and parallel computing algorithms has become a principal tool to improve training and execution times of deep neural networks as well as effective collective intelligence in sensor networks. In this paper, we consider the paralle
Externí odkaz:
http://arxiv.org/abs/2008.08289
Autor:
Rashidi, Saeed, Denton, Matthew, Sridharan, Srinivas, Srinivasan, Sudarshan, Suresh, Amoghavarsha, Ni, Jade, Krishna, Tushar
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in this work, driving this bandwidth is
Externí odkaz:
http://arxiv.org/abs/2007.00156
Autor:
RASHIDI, SAEED1 saeed.rashidi@gatech.edu, JALILI, MAJID1 majid@utexas.edu, SARBAZI-AZAD, HAMID2 azad@ipm.ir
Publikováno v:
ACM Computing Surveys. Jul2020, Vol. 52 Issue 4, p1-38. 38p.