Zobrazeno 1 - 10
of 673
pro vyhledávání: '"Abdelfattah, Mohamed"'
Bit-level sparsity methods skip ineffectual zero-bit operations and are typically applicable within bit-serial deep learning accelerators. This type of sparsity at the bit-level is especially interesting because it is both orthogonal and compatible w
Externí odkaz:
http://arxiv.org/abs/2409.05227
Autor:
Chang, Chi-Chih, Lin, Wei-Cheng, Lin, Chien-Yu, Chen, Chong-Yan, Hu, Yu-Fang, Wang, Pei-Shuo, Huang, Ning-Chi, Ceze, Luis, Abdelfattah, Mohamed S., Wu, Kai-Chiang
Post-training KV-Cache compression methods typically either sample a subset of effectual tokens or quantize the data into lower numerical bit width. However, these methods cannot exploit redundancy in the hidden dimension of the KV tensors. This pape
Externí odkaz:
http://arxiv.org/abs/2407.21118
FPGAs offer a flexible platform for accelerating deep neural network (DNN) inference, particularly for non-uniform workloads featuring fine-grained unstructured sparsity and mixed arithmetic precision. To leverage these redundancies, an emerging appr
Externí odkaz:
http://arxiv.org/abs/2407.06033
Autor:
Akhauri, Yash, AbouElhamayed, Ahmed F, Dotzel, Jordan, Zhang, Zhiru, Rush, Alexander M, Huda, Safeen, Abdelfattah, Mohamed S
The high power consumption and latency-sensitive deployments of large language models (LLMs) have motivated efficiency techniques like quantization and sparsity. Contextual sparsity, where the sparsity pattern is input-dependent, is crucial in LLMs b
Externí odkaz:
http://arxiv.org/abs/2406.16635
Autor:
Dotzel, Jordan, Chen, Yuzong, Kotb, Bahaa, Prasad, Sushma, Wu, Gang, Li, Sheng, Abdelfattah, Mohamed S., Zhang, Zhiru
The increasing size of large language models (LLMs) traditionally requires low-precision integer formats to meet strict latency and power demands. Yet recently, alternative formats such as Normal Float (NF4) have increased model accuracy at the cost
Externí odkaz:
http://arxiv.org/abs/2405.03103
Autor:
Dotzel, Jordan, Akhauri, Yash, AbouElhamayed, Ahmed S., Jiang, Carly, Abdelfattah, Mohamed, Zhang, Zhiru
Large language models (LLMs) often struggle with strict memory, latency, and power demands. To meet these demands, various forms of dynamic sparsity have been proposed that reduce compute on an input-by-input basis. These methods improve over static
Externí odkaz:
http://arxiv.org/abs/2404.04900
Autor:
Akhauri, Yash, Abdelfattah, Mohamed S.
Predictor-based methods have substantially enhanced Neural Architecture Search (NAS) optimization. The efficacy of these predictors is largely influenced by the method of encoding neural network architectures. While traditional encodings used an adja
Externí odkaz:
http://arxiv.org/abs/2403.02484
Autor:
Akhauri, Yash, Abdelfattah, Mohamed S.
Efficient deployment of neural networks (NN) requires the co-optimization of accuracy and latency. For example, hardware-aware neural architecture search has been used to automatically find NN architectures that satisfy a latency constraint on a spec
Externí odkaz:
http://arxiv.org/abs/2403.02446
Deep neural network (DNN) inference has become an important part of many data-center workloads. This has prompted focused efforts to design ever-faster deep learning accelerators such as GPUs and TPUs. However, an end-to-end DNN-based vision applicat
Externí odkaz:
http://arxiv.org/abs/2403.12981
Traditional methods, such as JPEG, perform image compression by operating on structural information, such as pixel values or frequency content. These methods are effective to bitrates around one bit per pixel (bpp) and higher at standard image sizes.
Externí odkaz:
http://arxiv.org/abs/2402.13536