O⁴-DNN: A Hybrid DSP-LUT-Based Processing Unit With Operation Packing and Out-of-Order Execution for Efficient Realization of Convolutional Neural Networks on FPGA Devices
Autor: | Pouya Haghi, Mehdi Kamal, Ali Afzali-Kusha, Massoud Pedram |
---|---|
Rok vydání: | 2020 |
Předmět: | |
Zdroj: | IEEE Transactions on Circuits and Systems I: Regular Papers. 67:3056-3069 |
ISSN: | 1558-0806 1549-8328 |
DOI: | 10.1109/tcsi.2020.2986350 |
Popis: | In this paper, we propose O4-DNN, a high-performance FPGA-based architecture for convolutional neural network (CNN) accelerators relying on o peration packing and o ut- o f- o rder ( OoO ) execution for DSP blocks augmented with LUT-based glue logic. The high-level architecture is comprised of a systolic array of processing elements (PEs), supporting output stationary dataflow. In this architecture, the computational unit of each PE is realized by using a DSP block as well as a small number of LUTs. Given the limited number of DSP blocks in FPGAs, the combination (DSP block and some LUTs) provides more computational power obtainable through each DSP block. The proposed computational unit performs eight convolutional operations on five input operands where one of them is an 8-bit weight and the others are four 8-bit input feature (IF) maps. In addition, to improve the energy efficiency of the proposed computational unit, we present an approximate form of the unit suitable for neural network applications. To reduce the memory bandwidth as well as increase the utilization of the computational units, a data reusing technique based on the weight sharing is also presented. To improve the performance of the proposed computational unit further, an addressing approach for computing the partial sums out-of-order is proposed. The efficacy of the architecture is assessed using two FPGA devices executing four state-of-the-art neural networks. Experimental results show that this architecture leads to, on average (up to), $2.5\times $ ( $3.44\times$ ) higher throughput compared to a baseline structure. In addition, on average (maximum of), 12% (40%) energy efficiency improvement is achievable by employing the O4-DNN compared to the baseline structure. |
Databáze: | OpenAIRE |
Externí odkaz: |