A 400MHz NPU with 7.8TOPS2/W High-PerformanceGuaranteed Efficiency in 55nm for Multi-Mode Pruning and Diverse Quantization Using Pattern-Kernel Encoding and Reconfigurable MAC Units

Autor: Zhanhong Tan, Sia-Huat Tan, Yannian Zhang, Yifu Wu, Kaisheng Ma, Jan-Henrik Lambrechts
Rok vydání: 2021
Předmět:
Zdroj: CICC
DOI: 10.1109/cicc51472.2021.9431519
Popis: Deep neural networks present a promising future in applications, ranging from face ID on mobile phones to self-driving cars. Weight pruning and quantization act as valuable solutions to release the burden of computation and memory. Figure 1 shows the family of weight pruning, including the fine-grained and several structural pruning methods. With similar compression rates, coarse-grained pruning results in more accuracy drop. A new structural solution called pattern pruning [5] achieves excellent precision with uniform sparsity rates among kernels, which is friendly to hardware. Kernels are encoded into non-zero values with sparse pattern masks (SPM). This work adopts 16 types of patterns with 4b SPM for the 3x3 convolution, which gains up to 8x compression for eight-zero kernels. As for quantization, the optimal choice generally depends on models.
Databáze: OpenAIRE