Stride 2 1-D, 2-D, and 3-D Winograd for Convolutional Neural Networks

Autor:	Seok-Bum Ko, Juan Yepez
Rok vydání:	2020
Předmět:	Computational complexity theory Kernel (image processing) Hardware and Architecture Computer science STRIDE Electrical and Electronic Engineering Arithmetic Convolutional neural network Software
Zdroj:	IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 28:853-863
ISSN:	1557-9999 1063-8210
DOI:	10.1109/tvlsi.2019.2961602
Popis:	Convolutional neural networks (CNNs) have been widely adopted for computer vision applications. CNNs require many multiplications, making their use expensive in terms of both computational complexity and hardware. An effective method to mitigate the number of required multiplications is via the Winograd algorithm. Previous implementations of CNNs based on Winograd use the 2-D algorithm $F(2 \times 2,3 \times 3)$ , which reduces computational complexity by a factor of 2.25 over regular convolution. However, current Winograd implementations only apply when using a stride (shift displacement of a kernel over an input) of 1. In this article, we presented a novel method to apply the Winograd algorithm to a stride of 2. This method is valid for one, two, or three dimensions. We also introduced new Winograd versions compatible with a kernel of size 3, 5, and 7. The algorithms were successfully implemented on an NVIDIA K20c GPU. Compared to regular convolutions, the implementations for stride 2 are 1.44 times faster for a $3 \times 3$ kernel, $2.04\times $ faster for a $5\times 5$ kernel, $2.42\times $ faster for a $7 \times 7$ kernel, and $1.73\times $ faster for a $3 \times 3 \times 3$ kernel. Additionally, a CNN accelerator using a novel processing element (PE) performs two 2-D Winograd stride 1, or one 2-D Winograd stride 2, and operations per clock cycle was implemented on an Intel Arria-10 field-programmable gate array (FPGA). We accelerated the original and our proposed modified VGG-16 architectures and achieved digital signal processor (DSP) efficiencies of 1.22 giga operations per second (GOPS)/DSPs and 1.33 GOPS/DSPs, respectively.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::0bf0af87e0b5656f4b004e044b29ce8f https://doi.org/10.1109/tvlsi.2019.2961602 Zobrazit plný text záznamu