REDUCT: Keep it Close, Keep it Cool! : Efficient Scaling of DNN Inference on Multi-core CPUs with Near-Cache Compute

Autor: Sreenivas Subramoney, Shankar Balachandran, Rahul Bera, Joydeep Rakshit, Anant V. Nori, Avishaii Abuhatzera, Belliappa Kuttanna, Omer Om J
Rok vydání: 2021
Předmět:
Zdroj: ISCA
DOI: 10.1109/isca52012.2021.00022
Popis: Deep Neural Networks (DNN) are used in a variety of applications and services. With the evolving nature of DNNs, the race to build optimal hardware (both in datacenter and edge) continues. General purpose multi-core CPUs offer unique attractive advantages for DNN inference at both datacenter [60] and edge [71]. Most of the CPU pipeline design complexity is targeted towards optimizing general-purpose single thread performance, and is overkill for relatively simpler, but still hugely important, data parallel DNN inference workloads. Addressing this disparity efficiently can enable both raw performance scaling and overall performance/Watt improvements for multi-core CPU DNN inference. We present REDUCT, where we build innovative solutions that bypass traditional CPU resources which impact DNN inference power and limit its performance. Fundamentally, REDUCT's "Keep it close" policy enables consecutive pieces of work to be executed close to each other. REDUCT enables instruction delivery/decode close to execution and instruction execution close to data. Simple ISA extensions encode the fixed-iteration count loop-y workload behavior enabling an effective bypass of many power-hungry front-end stages of the wide Out-of-Order (OoO) CPU pipeline. Per core performance scales efficiently by distributing lightweight tensor compute near all caches in a multi-level cache hierarchy. This maximizes the cumulative utilization of the existing architectural bandwidth resources in the system and minimizes movement of data. Across a number of DNN models, REDUCT achieves a 2.3X increase in convolution performance/Watt with a 2X to 3.94X scaling in raw performance. Similarly, REDUCT achieves a 1.8X increase in inner-product performance/Watt with 2.8X scaling in performance. REDUCT performance/power scaling is achieved with no increase to cache capacity or bandwidth and a mere 2.63% increase in area. Crucially, REDUCT operates entirely within the CPU programming and memory model, simplifying software development, while achieving performance similar to or better than state-of-the-art Domain Specific Accelerators (DSA) for DNN inference, providing fresh design choices in the AI era.
Databáze: OpenAIRE