A Configurable Cloud-Scale DNN Processor for Real-Time AI

Autor:	Fowers, Jeremy, Ovtcharov, Kalin, Papamichael, Michael, Massengill, Todd, Liu, Ming, Lo, Daniel, Alkalay, Shlomi, Haselman, Michael, Adams, Logan, Ghandi, Mahdi, Heil, Stephen, Patel, Prerak, Sapek, Adam, Weisz, Gabriel, Woods, Lisa, Lanka, Sitaram, Reinhardt, Steven K., Caulfield, Adrian M., Chung, Eric S., Burger, Doug
Rok vydání:	2018
Předmět:	010302 applied physics Artificial neural network Computer science business.industry Cloud computing 02 engineering and technology 01 natural sciences 020202 computer hardware & architecture Microarchitecture Instruction set Computer architecture 0103 physical sciences Stratix 0202 electrical engineering electronic engineering information engineering SIMD Field-programmable gate array business Compile time
Zdroj:	ISCA
DOI:	10.1109/isca.2018.00012
Popis:	Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models—aka ""real-time AI"". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling atypically high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::31f1c3001bdb00f28f895a5b5ffa4098 Zobrazit plný text záznamu