Zobrazeno 1 - 10
of 30
pro vyhledávání: '"Sharify, Sayeh"'
Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to miti
Externí odkaz:
http://arxiv.org/abs/2405.07135
Large language models (LLMs) can solve challenging tasks. However, their inference computation on modern GPUs is highly inefficient due to the increasing number of tokens they must attend to as they generate new ones. To address this inefficiency, we
Externí odkaz:
http://arxiv.org/abs/2404.09336
Quantization is commonly used to compress and accelerate deep neural networks. Quantization assigning the same bit-width to all layers leads to large accuracy degradation at low precision and is wasteful at high precision settings. Mixed-precision qu
Externí odkaz:
http://arxiv.org/abs/2307.05657
We motivate a method for transparently identifying ineffectual computations in unmodified Deep Learning models and without affecting accuracy. Specifically, we show that if we decompose multiplications down to the bit level the amount of work perform
Externí odkaz:
http://arxiv.org/abs/1805.04513
Autor:
Delmas, Alberto, Sharify, Sayeh, Judd, Patrick, Siu, Kevin, Nikolic, Milos, Moshovos, Andreas
We show that selecting a single data type (precision) for all values in Deep Neural Networks, even if that data type is different per layer, amounts to worst case design. Much shorter data types can be used if we target the common case by adjusting t
Externí odkaz:
http://arxiv.org/abs/1804.06732
Autor:
Delmas, Alberto, Judd, Patrick, Stuart, Dylan Malone, Poulos, Zissis, Mahmoud, Mostafa, Sharify, Sayeh, Nikolic, Milos, Moshovos, Andreas
We show that, during inference with Convolutional Neural Networks (CNNs), more than 2x to $8x ineffectual work can be exposed if instead of targeting those weights and activations that are zero, we target different combinations of value stream proper
Externí odkaz:
http://arxiv.org/abs/1803.03688
Tartan (TRT), a hardware accelerator for inference with Deep Neural Networks (DNNs), is presented and evaluated on Convolutional Neural Networks. TRT exploits the variable per layer precision requirements of DNNs to deliver execution time that is pro
Externí odkaz:
http://arxiv.org/abs/1707.09068
Loom (LM), a hardware inference accelerator for Convolutional Neural Networks (CNNs) is presented. In LM every bit of data precision that can be saved translates to proportional performance gains. Specifically, for convolutional layers LM's execution
Externí odkaz:
http://arxiv.org/abs/1706.07853
Stripes is a Deep Neural Network (DNN) accelerator that uses bit-serial computation to offer performance that is proportional to the fixed-point precision of the activation values. The fixed-point precisions are determined a priori using profiling an
Externí odkaz:
http://arxiv.org/abs/1706.00504
We discuss several modifications and extensions over the previous proposed Cnvlutin (CNV) accelerator for convolutional and fully-connected layers of Deep Learning Network. We first describe different encodings of the activations that are deemed inef
Externí odkaz:
http://arxiv.org/abs/1705.00125