Abstract | ||
---|---|---|
Sequence model based NLP applications canbe large. Yet, many applications that benefit from them run on small devices with very limited compute and storage capabilities, while still having run-time constraints.As a result, there is a need for a compression technique that can achieve significant compression without negatively impacting inference run-time and task accuracy. This paper proposes a new compression technique called Hybrid Matrix Factorization (HMF) that achieves this dual objective. HMF improves low-rank matrix factorization (LMF) techniques by doubling the rank of the matrix using an intelligent hybrid-structure leading to better accuracy than LMF. Further, by preserving dense matrices, it leads to faster inference run-timethan pruning or structure matrix based compression technique. We evaluate the impact of this technique on 5 NLP benchmarks across multiple tasks (Translation, Intent Detection,Language Modeling) and show that for similar accuracy values and compression factors, HMF can achieve more than 2.32x faster inference run-time than pruning and 16.77% better accuracy than LMF. |
Year | DOI | Venue |
---|---|---|
2020 | 10.18653/V1/2020.SUSTAINLP-1.2 | Conference on Empirical Methods in Natural Language Processing |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Urmish Thakker | 1 | 1 | 3.74 |
Jesse G. Beu | 2 | 2 | 3.41 |
Dibakar Gope | 3 | 10 | 3.29 |
Ganesh S. Dasika | 4 | 387 | 24.30 |
Matthew Mattina | 5 | 441 | 28.63 |