A Tensor Processing Framework for CPU-Manycore Heterogeneous Systems
Autor: | Michael Taylor, Dustin Richmond, Max Ruttenberg, Peitian Pan, Zhiru Zhang, Seyed Borna Ehsani, Preslav Ivanov, Krithik Ranjan, Dai Cheol Jung, Christopher Batten, Lin Cheng, Jack Weber, Bandhav Veluri, Zhongyuan Zhao |
---|---|
Rok vydání: | 2022 |
Předmět: | |
Zdroj: | IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 41:1620-1635 |
ISSN: | 1937-4151 0278-0070 |
Popis: | Future CPU-manycore heterogeneous systems can provide high peak throughput by integrating thousands of simple, independent, energy-efficient cores in a single die. However, there are two key challenges to translating this high peak throughput into improved end-to-end workload performance: (1) manycore co-processors rely on simple hardware putting significant demands on the software programmer; and (2) manycore co-processors use in-order cores that struggle to tolerate long memory latencies. To address the manycore programmability challenge, this paper presents a dense and sparse tensor processing framework based on PyTorch that enables domain experts to easily accelerate off-the-shelf workloads on CPUmanycore heterogeneous systems. To address the manycore memory latency challenge, we use our extended PyTorch framework to explore the potential for decoupled access/execute (DAE) software and hardware mechanisms. More specifically, we propose two software-only techniques, naive-software DAE and systolic-software DAE, along with a lightweight hardware access accelerator to further improve area-normalized throughput. We evaluate our techniques using a combination of PyTorch operator microbenchmarking and real-world PyTorch workloads running on a detailed register-transfer-level model of a 128-core manycore architecture. Our evaluation on three real-world dense and sparse tensor workloads suggest these workloads can achieve approximately 2-6× performance improvement when scaled to a future 2,000-core CPU-manycore heterogeneous system compared to an 18-core out-of-order CPU baseline, while potentially achieving higher area-normalized throughput and improved energyefficiency compared to general-purpose graphics processing units. |
Databáze: | OpenAIRE |
Externí odkaz: |