Zobrazeno 1 - 10
of 202
pro vyhledávání: '"Abdelfattah, Mohamed S."'
Bit-level sparsity methods skip ineffectual zero-bit operations and are typically applicable within bit-serial deep learning accelerators. This type of sparsity at the bit-level is especially interesting because it is both orthogonal and compatible w
Externí odkaz:
http://arxiv.org/abs/2409.05227
FPGAs offer a flexible platform for accelerating deep neural network (DNN) inference, particularly for non-uniform workloads featuring fine-grained unstructured sparsity and mixed arithmetic precision. To leverage these redundancies, an emerging appr
Externí odkaz:
http://arxiv.org/abs/2407.06033
Autor:
Akhauri, Yash, AbouElhamayed, Ahmed F, Dotzel, Jordan, Zhang, Zhiru, Rush, Alexander M, Huda, Safeen, Abdelfattah, Mohamed S
The high power consumption and latency-sensitive deployments of large language models (LLMs) have motivated efficiency techniques like quantization and sparsity. Contextual sparsity, where the sparsity pattern is input-dependent, is crucial in LLMs b
Externí odkaz:
http://arxiv.org/abs/2406.16635
Autor:
Dotzel, Jordan, Chen, Yuzong, Kotb, Bahaa, Prasad, Sushma, Wu, Gang, Li, Sheng, Abdelfattah, Mohamed S., Zhang, Zhiru
The increasing size of large language models (LLMs) traditionally requires low-precision integer formats to meet strict latency and power demands. Yet recently, alternative formats such as Normal Float (NF4) have increased model accuracy at the cost
Externí odkaz:
http://arxiv.org/abs/2405.03103
Autor:
Akhauri, Yash, Abdelfattah, Mohamed S.
Predictor-based methods have substantially enhanced Neural Architecture Search (NAS) optimization. The efficacy of these predictors is largely influenced by the method of encoding neural network architectures. While traditional encodings used an adja
Externí odkaz:
http://arxiv.org/abs/2403.02484
Autor:
Akhauri, Yash, Abdelfattah, Mohamed S.
Efficient deployment of neural networks (NN) requires the co-optimization of accuracy and latency. For example, hardware-aware neural architecture search has been used to automatically find NN architectures that satisfy a latency constraint on a spec
Externí odkaz:
http://arxiv.org/abs/2403.02446
Deep neural network (DNN) inference has become an important part of many data-center workloads. This has prompted focused efforts to design ever-faster deep learning accelerators such as GPUs and TPUs. However, an end-to-end DNN-based vision applicat
Externí odkaz:
http://arxiv.org/abs/2403.12981
Autor:
Hunter, Rosco, Dudziak, Łukasz, Abdelfattah, Mohamed S., Mehrotra, Abhinav, Bhattacharya, Sourav, Wen, Hongkai
Text-to-image diffusion models have demonstrated unprecedented capabilities for flexible and realistic image synthesis. Nevertheless, these models rely on a time-consuming sampling procedure, which has motivated attempts to reduce their latency. When
Externí odkaz:
http://arxiv.org/abs/2401.01008
Mixed-precision quantization is a popular approach for compressing deep neural networks (DNNs). However, it is challenging to scale the performance efficiently with mixed-precision DNNs given the current FPGA architecture and conventional accelerator
Externí odkaz:
http://arxiv.org/abs/2311.02758
Autor:
Dotzel, Jordan, Wu, Gang, Li, Andrew, Umar, Muhammad, Ni, Yun, Abdelfattah, Mohamed S., Zhang, Zhiru, Cheng, Liqun, Dixon, Martin G., Jouppi, Norman P., Le, Quoc V., Li, Sheng
Quantization has become a mainstream compression technique for reducing model size, computational requirements, and energy consumption for modern deep neural networks (DNNs). With improved numerical support in recent hardware, including multiple vari
Externí odkaz:
http://arxiv.org/abs/2308.03290