Zobrazeno 1 - 10
of 17
pro vyhledávání: '"Shen, Haihao"'
Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which requires a dem
Externí odkaz:
http://arxiv.org/abs/2311.00502
As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computationalast layer demands of these modern architectures while maintaining the accuracy. In this paper, we
Externí odkaz:
http://arxiv.org/abs/2310.10944
Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we s
Externí odkaz:
http://arxiv.org/abs/2309.14592
Large Language Models (LLMs) have demonstrated exceptional proficiency in language-related tasks, but their deployment poses significant challenges due to substantial memory and storage requirements. Weight-only quantization has emerged as a promisin
Externí odkaz:
http://arxiv.org/abs/2309.05516
Autor:
Shen, Haihao, Meng, Hengyu, Dong, Bo, Wang, Zhe, Zafrir, Ofir, Ding, Yi, Luo, Yu, Chang, Hanwen, Gao, Qun, Wang, Ziheng, Boudoukh, Guy, Wasserblat, Moshe
In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the
Externí odkaz:
http://arxiv.org/abs/2306.16601
Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller trans
Externí odkaz:
http://arxiv.org/abs/2210.17114
Autor:
Shen, Haihao, Zafrir, Ofir, Dong, Bo, Meng, Hengyu, Ye, Xinyu, Wang, Zhe, Ding, Yi, Chang, Hanwen, Boudoukh, Guy, Wasserblat, Moshe
Transformer-based language models have become the standard approach to solving natural language processing tasks. However, industry adoption usually requires the maximum throughput to comply with certain latency constraints that prevents Transformer
Externí odkaz:
http://arxiv.org/abs/2211.07715
Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the imple
Externí odkaz:
http://arxiv.org/abs/2111.05754
Autor:
Gong, Jiong, Shen, Haihao, Zhang, Guoming, Liu, Xiaoli, Li, Shane, Jin, Ge, Maheshwari, Niharika, Fomenko, Evarist, Segal, Eden
High throughput and low latency inference of deep neural networks are critical for the deployment of deep learning applications. This paper presents the efficient inference techniques of IntelCaffe, the first Intel optimized deep learning framework t
Externí odkaz:
http://arxiv.org/abs/1805.08691
Akademický článek
Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.
K zobrazení výsledku je třeba se přihlásit.