Zobrazeno 1 - 10
of 106
pro vyhledávání: '"P Frantar"'
As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also b
Externí odkaz:
http://arxiv.org/abs/2408.11743
Autor:
Egiazarian, Vage, Panferov, Andrei, Kuznedelev, Denis, Frantar, Elias, Babenko, Artem, Alistarh, Dan
The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices. In this paper, we revisit the problem of "extreme" LLM compression-defined
Externí odkaz:
http://arxiv.org/abs/2401.06118
Autor:
Frantar, Elias, Alistarh, Dan
Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the Switch
Externí odkaz:
http://arxiv.org/abs/2310.16795
Autor:
Ashkboos, Saleh, Markov, Ilia, Frantar, Elias, Zhong, Tingxuan, Wang, Xincheng, Ren, Jie, Hoefler, Torsten, Alistarh, Dan
Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantiza
Externí odkaz:
http://arxiv.org/abs/2310.09259
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fine-tuning pretrained LLMs on specialized tasks, while inducing sparsity in their weights. On the accuracy side, we observe that standard loss-based fin
Externí odkaz:
http://arxiv.org/abs/2310.06927
It is known that sparsity can improve interpretability for deep neural networks. However, existing methods in the area either require networks that are pre-trained with sparsity constraints, or impose sparsity after the fact, altering the network's g
Externí odkaz:
http://arxiv.org/abs/2310.04519
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e., "foundation models"), in both vision and language domains. In this setting, we identify the first scaling law describing the relati
Externí odkaz:
http://arxiv.org/abs/2309.08520
Autor:
Kuznedelev, Denis, Kurtic, Eldar, Iofinova, Eugenia, Frantar, Elias, Peste, Alexandra, Alistarh, Dan
Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community. Yet, m
Externí odkaz:
http://arxiv.org/abs/2308.02060
We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, includi
Externí odkaz:
http://arxiv.org/abs/2307.03738
Leveraging second-order information about the loss at the scale of deep networks is one of the main lines of approach for improving the performance of current optimizers for deep learning. Yet, existing approaches for accurate full-matrix preconditio
Externí odkaz:
http://arxiv.org/abs/2306.06098