Popis: |
Interactive intelligent services (e.g., smart web search) are becoming essential datacenter workloads. They rely on data-intensive artificial intelligence (AI) algorithms that do not use batch computation due to their tight latency constraints. Since off-chip data accesses have higher latency and energy consumption than on-chip accesses, a persistent AI approach with the entire model stored in on-chip memory is becoming the new norm for real-time AI. This approach is the cornerstone of Microsoft's Brainwave FPGA-based AI cloud and was recently added to Nvidia's cuDNN library. In this work, we implement, optimize and evaluate a Brainwave-like neural processing unit (NPU) on a large Stratix-10 FPGA. We benchmark it against a large Nvidia Volta GPU running cuDNN persistent AI kernels. Across real-time persistent RNN, GRU, and LSTM workloads, we show that Stratix-10 offers ~3× (FP32) and ~10× (INT8) better latency than GPU (FP32), which uses only ~6% of its peak throughput. Then, we propose TensorRAM, an ASIC chiplet for persistent AI that is 2.5D integrated with an FPGA in the same package. TensorRAM enhances the on-chip memory capacity and bandwidth, with enough multi-precision INT8/4/2/1 throughput to match that bandwidth. Multiple TensorRAMs can be integrated with Stratix-10. Our evaluation shows that a small 32-mm2 TensorRAM on 10nm offers 64MB of SRAMs with 32TB/s on-chiplet bandwidth and 64 TOP/s (INT8). A small Stratix-10 with a TensorRAM (INT8) offers 16× better latency and 34× energy efficiency compared to GPU (FP32). Overall, Stratix-10 with TensorRAM offers compelling and scalable persistent AI solutions. |