Maestro: A Memory-on-Logic Architecture for Coordinated Parallel Use of Many Systolic Arrays
Autor: | Xin Dong, Chih Chiang Chen, Hsiang-Tsung Kung, Sai Qian Zhang, Bradley McDanel |
---|---|
Rok vydání: | 2019 |
Předmět: |
Artificial neural network
Page layout Computer science business.industry Deep learning 05 social sciences Inference Parallel computing 010501 environmental sciences computer.software_genre 01 natural sciences Convolutional neural network Short distance 0502 economics and business Artificial intelligence 050207 economics Architecture business computer 0105 earth and related environmental sciences |
Zdroj: | ASAP |
Popis: | We present the Maestro memory-on-logic 3D-IC architecture for coordinated parallel use of a plurality of systolic arrays (SAs) in performing deep neural network (DNN) inference. Maestro reduces under-utilization common for a single large SA by allowing parallel use of many smaller SAs on DNN weight matrices of varying shapes and sizes. In order to buffer immediate results in memory blocks (MBs) and provide coordinated high-bandwidth communication between SAs and MBs in transferring weights and results Maestro employs three innovations. (1) An SA on the logic die can access its corresponding MB on the memory die in short distance using 3D-IC interconnects, (2) through an efficient switch based on H-trees, an SA can access any MB with low latency, and (3) the switch can combine partial results from SAs in an elementwise fashion before writing back to a destination MB. We describe the Maestro architecture, including a circuit and layout design, detail scheduling of the switch, analyze system performance for real-time inference applications using input with batch size equal to one, and showcase applications for deep learning inference, with ShiftNet for computer vision and recent Transformer models for natural language processing. For the same total number of systolic cells, Maestro, with multiple smaller SAs, leads to 16x and 12x latency improvements over a single large SA on ShiftNet and Transformer, respectively. Compared to a floating-point GPU implementation of ShiftNet and Transform, a baseline Maestro system with 4,096 SAs (each with 8x8 systolic cells) provides significant latency improvements of 30x and 47x, respectively. |
Databáze: | OpenAIRE |
Externí odkaz: |