Enhancing the Utilization of Processing Elements in Spatial Deep Neural Network Accelerators
Autor: | Seok-Bum Ko, Mohammadreza Asadikouhanjani |
---|---|
Rok vydání: | 2021 |
Předmět: |
050210 logistics & transportation
0209 industrial biotechnology Speedup Dataflow Least slack time scheduling Computer science business.industry Deep learning Reference design 05 social sciences 02 engineering and technology Computer Graphics and Computer-Aided Design 020901 industrial engineering & automation Network on a chip Computer engineering 0502 economics and business Bandwidth (computing) Overhead (computing) Artificial intelligence Electrical and Electronic Engineering business Software |
Zdroj: | IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 40:1947-1951 |
ISSN: | 1937-4151 0278-0070 |
DOI: | 10.1109/tcad.2020.3031240 |
Popis: | Equipping mobile platforms with deep learning applications is very valuable. Providing healthcare services in remote areas, improving privacy, and lowering needed communication bandwidth are the advantages of such platforms. Designing an efficient computation engine enhances the performance of these platforms while running deep neural networks (DNNs). Energy-efficient DNN accelerators use skipping sparsity and early negative output feature detection to prune the computations. Spatial DNN accelerators in principle can support computation-pruning techniques compared to other common architectures, such as systolic arrays. These accelerators need a separate data distribution fabric like buses or trees with support for high bandwidth to run the mentioned techniques efficiently and avoid network on chip (NoC)-based stalls. Spatial designs suffer from divergence and unequal work distribution. Therefore, applying computation-pruning techniques into a spatial design, which is even equipped with an NoC that supports high bandwidth for the processing elements (PEs), still causes stalls inside the computation engine. In a spatial architecture, the PEs that perform their tasks earlier have a slack time compared to others. In this article, we propose an architecture with a negligible area overhead based on sharing the scratchpads in a novel way between the PEs to use the available slack time caused by applying computation-pruning techniques or the used NoC format. With the use of our dataflow, a spatial engine can benefit from computation-pruning and data reuse techniques more efficiently. When compared to the reference design, our proposed method achieves a speedup of $\times 1.24$ and an energy efficiency of $\times 1.18$ per inference. |
Databáze: | OpenAIRE |
Externí odkaz: |