Výsledky vyhledávání - "Sharify, Sayeh"

Report

Combining multiple post-training techniques to achieve most efficient quantized LLMs

Autor: Sharify, Sayeh, Xu, Zifei, Yazar, Wanzin, Wang, Xin

Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to miti

Externí odkaz: http://arxiv.org/abs/2405.07135

Zobrazit plný text záznamu

Report

Self-Selected Attention Span for Accelerating Large Language Model Inference

Autor: Jin, Tian, Yazar, Wanzin, Xu, Zifei, Sharify, Sayeh, Wang, Xin

Large language models (LLMs) can solve challenging tasks. However, their inference computation on modern GPUs is highly inefficient due to the increasing number of tokens they must attend to as they generate new ones. To address this inefficiency, we

Externí odkaz: http://arxiv.org/abs/2404.09336

Zobrazit plný text záznamu

Report

Mixed-Precision Quantization with Cross-Layer Dependencies

Autor: Deng, Zihao, Wang, Xin, Sharify, Sayeh, Orshansky, Michael

Quantization is commonly used to compress and accelerate deep neural networks. Quantization assigning the same bit-width to all layers leads to large accuracy degradation at low precision and is wasteful at high precision settings. Mixed-precision qu

Externí odkaz: http://arxiv.org/abs/2307.05657

Zobrazit plný text záznamu

Report

Laconic Deep Learning Computing

Autor: Sharify, Sayeh, Mahmoud, Mostafa, Lascorz, Alberto Delmas, Nikolic, Milos, Moshovos, Andreas

We motivate a method for transparently identifying ineffectual computations in unmodified Deep Learning models and without affecting accuracy. Specifically, we show that if we decompose multiplications down to the bit level the amount of work perform

Externí odkaz: http://arxiv.org/abs/1805.04513

Zobrazit plný text záznamu

Report

DPRed: Making Typical Activation and Weight Values Matter In Deep Learning Computing

Autor: Delmas, Alberto, Sharify, Sayeh, Judd, Patrick, Siu, Kevin, Nikolic, Milos, Moshovos, Andreas

We show that selecting a single data type (precision) for all values in Deep Neural Networks, even if that data type is different per layer, amounts to worst case design. Much shorter data types can be used if we target the common case by adjusting t

Externí odkaz: http://arxiv.org/abs/1804.06732

Zobrazit plný text záznamu

Report

Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How

Autor: Delmas, Alberto, Judd, Patrick, Stuart, Dylan Malone, Poulos, Zissis, Mahmoud, Mostafa, Sharify, Sayeh, Nikolic, Milos, Moshovos, Andreas

We show that, during inference with Convolutional Neural Networks (CNNs), more than 2x to $8x ineffectual work can be exposed if instead of targeting those weights and activations that are zero, we target different combinations of value stream proper

Externí odkaz: http://arxiv.org/abs/1803.03688

Zobrazit plný text záznamu

Report

Tartan: Accelerating Fully-Connected and Convolutional Layers in Deep Learning Networks by Exploiting Numerical Precision Variability

Autor: Delmas, Alberto, Sharify, Sayeh, Judd, Patrick, Moshovos, Andreas

Tartan (TRT), a hardware accelerator for inference with Deep Neural Networks (DNNs), is presented and evaluated on Convolutional Neural Networks. TRT exploits the variable per layer precision requirements of DNNs to deliver execution time that is pro

Externí odkaz: http://arxiv.org/abs/1707.09068

Zobrazit plný text záznamu

Report

Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks

Autor: Sharify, Sayeh, Lascorz, Alberto Delmas, Siu, Kevin, Judd, Patrick, Moshovos, Andreas

Loom (LM), a hardware inference accelerator for Convolutional Neural Networks (CNNs) is presented. In LM every bit of data precision that can be saved translates to proportional performance gains. Specifically, for convolutional layers LM's execution

Externí odkaz: http://arxiv.org/abs/1706.07853

Zobrazit plný text záznamu

Report

Dynamic Stripes: Exploiting the Dynamic Precision Requirements of Activation Values in Neural Networks

Autor: Delmas, Alberto, Judd, Patrick, Sharify, Sayeh, Moshovos, Andreas

Stripes is a Deep Neural Network (DNN) accelerator that uses bit-serial computation to offer performance that is proportional to the fixed-point precision of the activation values. The fixed-point precisions are determined a priori using profiling an

Externí odkaz: http://arxiv.org/abs/1706.00504

Zobrazit plný text záznamu

Report

Cnvlutin2: Ineffectual-Activation-and-Weight-Free Deep Neural Network Computing

Autor: Judd, Patrick, Delmas, Alberto, Sharify, Sayeh, Moshovos, Andreas

We discuss several modifications and extensions over the previous proposed Cnvlutin (CNV) accelerator for convolutional and fully-connected layers of Deep Learning Network. We first describe different encodings of the activations that are deemed inef

Externí odkaz: http://arxiv.org/abs/1705.00125

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání