Physically Tightly Coupled, Logically Loosely Coupled, Near-Memory BNN Accelerator (PTLL-BNN)

Autor:	Wen-Chien Ting, Tai-Hsing Wen, Jun-Shen Wu, Yun-Chen Lo, Ren-Shuo Liu, Jian-Hao Huang, Yun-Sheng Chang, Yu-Chun Kuo
Rok vydání:	2019
Předmět:	010302 applied physics business.industry Computer science 020208 electrical & electronic engineering 02 engineering and technology Chip computer.software_genre 01 natural sciences Microarchitecture Power (physics) Logic gate 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Fuse (electrical) Static random-access memory Compiler business Field-programmable gate array computer Computer hardware
Zdroj:	ESSCIRC
DOI:	10.1109/esscirc.2019.8902909
Popis:	In this paper, a physically tightly coupled, logically loosely coupled, near-memory binary neural network accelerator (PTLL-BNN) is designed and fabricated. Both architecture-level and circuit-level optimizations are presented. From the perspective of processor architecture, the PTLL-BNN includes two new design choices. First, the proposed BNN accelerator is placed close to the SRAM of the embedded processors (i.e., physically tightly coupled and near-memory); thus, the extra SRAM cost that is incurred by the accelerator is as low as 0.5 KB. Second, the accelerator is a memory-mapped IO (MMIO) device (i.e., logically loosely coupled), so all embedded processors can be equipped with the proposed accelerator without the burden of changing their compilers and pipelines. From the circuit perspective, this work employs four techniques to optimize the power and costs of the accelerator. First, this design adopts a unified input-kernel-output memory instead of separate ones, which many previous works adopt. Second, the data layout that this work chooses increases the sequentiality of the SRAM accesses and reduces the buffer size of storing the intermediate values. Third, this work innovatively proposes to fuse the max-pooling, batch-normalization, and binarization layers of the BNNs to significantly reduce the hardware complexity. Finally, a novel methodology of generating the scheduler hardware of the accelerator is included. We fabricate the accelerator using the TSMC 180 nm technology. The chip measurement results reach 91 GOP/s on average (307 GOP/s at peak) at 200 MHz. The achieved GOP/s per million logic gates and GOP/s per KB SRAM are 2.6 to 237 times greater than that of previous works, respectively. We also realize an FPGA system to demonstrate the recognition of CIFAR-10/100 images using the fabricated accelerator.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::90128cd08e9540b66706c36768bbb349 https://doi.org/10.1109/esscirc.2019.8902909 Zobrazit plný text záznamu