Zobrazeno 1 - 7
of 7
pro vyhledávání: '"Pati, Suchita"'
Modern accelerators like GPUs are increasingly executing independent operations concurrently to improve the device's compute utilization. However, effectively harnessing it on GPUs for important primitives such as general matrix multiplications (GEMM
Externí odkaz:
http://arxiv.org/abs/2409.02227
Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed t
Externí odkaz:
http://arxiv.org/abs/2401.16677
Data format innovations have been critical for machine learning (ML) scaling, which in turn fuels ground-breaking ML capabilities. However, even in the presence of low-precision formats, model weights are often stored in both high-precision and low-p
Externí odkaz:
http://arxiv.org/abs/2311.05034
Scaling neural network models has delivered dramatic quality gains across ML problems. However, this scaling has increased the reliance on efficient distributed training techniques. Accordingly, as with other distributed computing scenarios, it is im
Externí odkaz:
http://arxiv.org/abs/2302.02825
Transfer learning in natural language processing (NLP), as realized using models like BERT (Bi-directional Encoder Representation from Transformer), has significantly improved language representation with models that can tackle challenging language p
Externí odkaz:
http://arxiv.org/abs/2104.08335
The ubiquity of deep neural networks (DNNs) continues to rise, making them a crucial application class for hardware optimizations. However, detailed profiling and characterization of DNN training remains difficult as these applications often run for
Externí odkaz:
http://arxiv.org/abs/2007.10459
Autor:
Lew, Jonathan, Shah, Deval, Pati, Suchita, Cattell, Shaylin, Zhang, Mengchi, Sandhupatla, Amruth, Ng, Christopher, Goli, Negar, Sinclair, Matthew D., Rogers, Timothy G., Aamodt, Tor
Most deep neural networks deployed today are trained using GPUs via high-level frameworks such as TensorFlow and PyTorch. This paper describes changes we made to the GPGPU-Sim simulator to enable it to run PyTorch by running PTX kernels included in N
Externí odkaz:
http://arxiv.org/abs/1811.08933