Rank and run-time aware compression of NLP Applications

Autor: Dibakar Gope, Urmish Thakker, Ganesh Dasika, Matthew Mattina, Jesse Beu
Rok vydání: 2020
Předmět:
DOI: 10.48550/arxiv.2010.03193
Popis: Sequence model based NLP applications can be large. Yet, many applications that benefit from them run on small devices with very limited compute and storage capabilities, while still having run-time constraints. As a result, there is a need for a compression technique that can achieve significant compression without negatively impacting inference run-time and task accuracy. This paper proposes a new compression technique called Hybrid Matrix Factorization that achieves this dual objective. HMF improves low-rank matrix factorization (LMF) techniques by doubling the rank of the matrix using an intelligent hybrid-structure leading to better accuracy than LMF. Further, by preserving dense matrices, it leads to faster inference run-time than pruning or structure matrix based compression technique. We evaluate the impact of this technique on 5 NLP benchmarks across multiple tasks (Translation, Intent Detection, Language Modeling) and show that for similar accuracy values and compression factors, HMF can achieve more than 2.32x faster inference run-time than pruning and 16.77% better accuracy than LMF.
Comment: Published at SustaiNLP@EMNLP 2020. arXiv admin note: text overlap with arXiv:1906.04886
Databáze: OpenAIRE