Evaluating and Enhancing Intel® Stratix® 10 FPGAs for Persistent Real-Time AI

Autor:	Eriko Nurvitadhi, Raghavan Kumar, Martin Langhammer, Ali Jafari, Gregory K. Chen, Jaewoong Sim, Phillip Tomson, Sergey Gribok, Debbie Marr, Ram Krishnamurthy, Aravind Dasu, Knag Phil, Andrew Boutros, Bogdan Pasca, Dongup Kwon, Sumbul Huseyin Ekin
Rok vydání:	2019
Předmět:	Application-specific integrated circuit business.industry Computer science Deep learning Embedded system Stratix Scalability Cloud computing Artificial intelligence Latency (engineering) business Field-programmable gate array Efficient energy use
Zdroj:	FPGA
DOI:	10.1145/3289602.3293943
Popis:	Interactive intelligent services (e.g., smart web search) are becoming essential datacenter workloads. They rely on data-intensive artificial intelligence (AI) algorithms that do not use batch computation due to their tight latency constraints. Since off-chip data accesses have higher latency and energy consumption than on-chip accesses, a persistent AI approach with the entire model stored in on-chip memory is becoming the new norm for real-time AI. This approach is the cornerstone of Microsoft's Brainwave FPGA-based AI cloud and was recently added to Nvidia's cuDNN library. In this work, we implement, optimize and evaluate a Brainwave-like neural processing unit (NPU) on a large Stratix-10 FPGA. We benchmark it against a large Nvidia Volta GPU running cuDNN persistent AI kernels. Across real-time persistent RNN, GRU, and LSTM workloads, we show that Stratix-10 offers ~3× (FP32) and ~10× (INT8) better latency than GPU (FP32), which uses only ~6% of its peak throughput. Then, we propose TensorRAM, an ASIC chiplet for persistent AI that is 2.5D integrated with an FPGA in the same package. TensorRAM enhances the on-chip memory capacity and bandwidth, with enough multi-precision INT8/4/2/1 throughput to match that bandwidth. Multiple TensorRAMs can be integrated with Stratix-10. Our evaluation shows that a small 32-mm2 TensorRAM on 10nm offers 64MB of SRAMs with 32TB/s on-chiplet bandwidth and 64 TOP/s (INT8). A small Stratix-10 with a TensorRAM (INT8) offers 16× better latency and 34× energy efficiency compared to GPU (FP32). Overall, Stratix-10 with TensorRAM offers compelling and scalable persistent AI solutions.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::ec019e5424875aa387716e89b6152bec https://doi.org/10.1145/3289602.3293943 Zobrazit plný text záznamu