Výsledky vyhledávání

Report

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Autor: Frantar, Elias, Castro, Roberto L., Chen, Jiale, Hoefler, Torsten, Alistarh, Dan

As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also b

Externí odkaz: http://arxiv.org/abs/2408.11743

Zobrazit plný text záznamu

Report

Extreme Compression of Large Language Models via Additive Quantization

Autor: Egiazarian, Vage, Panferov, Andrei, Kuznedelev, Denis, Frantar, Elias, Babenko, Artem, Alistarh, Dan

The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices. In this paper, we revisit the problem of "extreme" LLM compression-defined

Externí odkaz: http://arxiv.org/abs/2401.06118

Zobrazit plný text záznamu

Report

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

Autor: Frantar, Elias, Alistarh, Dan

Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the Switch

Externí odkaz: http://arxiv.org/abs/2310.16795

Zobrazit plný text záznamu

Report

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

Autor: Ashkboos, Saleh, Markov, Ilia, Frantar, Elias, Zhong, Tingxuan, Wang, Xincheng, Ren, Jie, Hoefler, Torsten, Alistarh, Dan

Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantiza

Externí odkaz: http://arxiv.org/abs/2310.09259

Zobrazit plný text záznamu

Report

Sparse Fine-tuning for Inference Acceleration of Large Language Models

Autor: Kurtic, Eldar, Kuznedelev, Denis, Frantar, Elias, Goin, Michael, Alistarh, Dan

We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fine-tuning pretrained LLMs on specialized tasks, while inducing sparsity in their weights. On the accuracy side, we observe that standard loss-based fin

Externí odkaz: http://arxiv.org/abs/2310.06927

Zobrazit plný text záznamu

Report

SPADE: Sparsity-Guided Debugging for Deep Neural Networks

Autor: Moakhar, Arshia Soltani, Iofinova, Eugenia, Frantar, Elias, Alistarh, Dan

It is known that sparsity can improve interpretability for deep neural networks. However, existing methods in the area either require networks that are pre-trained with sparsity constraints, or impose sparsity after the fact, altering the network's g

Externí odkaz: http://arxiv.org/abs/2310.04519

Zobrazit plný text záznamu

Report

Scaling Laws for Sparsely-Connected Foundation Models

Autor: Frantar, Elias, Riquelme, Carlos, Houlsby, Neil, Alistarh, Dan, Evci, Utku

We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e., "foundation models"), in both vision and language domains. In this setting, we identify the first scaling law describing the relati

Externí odkaz: http://arxiv.org/abs/2309.08520

Zobrazit plný text záznamu

Report

Accurate Neural Network Pruning Requires Rethinking Sparse Optimization

Autor: Kuznedelev, Denis, Kurtic, Eldar, Iofinova, Eugenia, Frantar, Elias, Peste, Alexandra, Alistarh, Dan

Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community. Yet, m

Externí odkaz: http://arxiv.org/abs/2308.02060

Zobrazit plný text záznamu

Report

QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

Autor: Pegolotti, Tommaso, Frantar, Elias, Alistarh, Dan, Püschel, Markus

We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, includi

Externí odkaz: http://arxiv.org/abs/2307.03738

Zobrazit plný text záznamu

Report

Error Feedback Can Accurately Compress Preconditioners

Autor: Modoranu, Ionut-Vlad, Kalinov, Aleksei, Kurtic, Eldar, Frantar, Elias, Alistarh, Dan

Leveraging second-order information about the loss at the scale of deep networks is one of the main lines of approach for improving the performance of current optimizers for deep learning. Yet, existing approaches for accurate full-matrix preconditio

Externí odkaz: http://arxiv.org/abs/2306.06098

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání