A Tensor Processing Framework for CPU-Manycore Heterogeneous Systems

Autor:	Michael Taylor, Dustin Richmond, Max Ruttenberg, Peitian Pan, Zhiru Zhang, Seyed Borna Ehsani, Preslav Ivanov, Krithik Ranjan, Dai Cheol Jung, Christopher Batten, Lin Cheng, Jack Weber, Bandhav Veluri, Zhongyuan Zhao
Rok vydání:	2022
Předmět:	Computer science business.industry Computer Graphics and Computer-Aided Design CAS latency Domain (software engineering) Software Embedded system Key (cryptography) Electrical and Electronic Engineering Graphics Performance improvement business Programmer Throughput (business)
Zdroj:	IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 41:1620-1635
ISSN:	1937-4151 0278-0070
Popis:	Future CPU-manycore heterogeneous systems can provide high peak throughput by integrating thousands of simple, independent, energy-efficient cores in a single die. However, there are two key challenges to translating this high peak throughput into improved end-to-end workload performance: (1) manycore co-processors rely on simple hardware putting significant demands on the software programmer; and (2) manycore co-processors use in-order cores that struggle to tolerate long memory latencies. To address the manycore programmability challenge, this paper presents a dense and sparse tensor processing framework based on PyTorch that enables domain experts to easily accelerate off-the-shelf workloads on CPUmanycore heterogeneous systems. To address the manycore memory latency challenge, we use our extended PyTorch framework to explore the potential for decoupled access/execute (DAE) software and hardware mechanisms. More specifically, we propose two software-only techniques, naive-software DAE and systolic-software DAE, along with a lightweight hardware access accelerator to further improve area-normalized throughput. We evaluate our techniques using a combination of PyTorch operator microbenchmarking and real-world PyTorch workloads running on a detailed register-transfer-level model of a 128-core manycore architecture. Our evaluation on three real-world dense and sparse tensor workloads suggest these workloads can achieve approximately 2-6× performance improvement when scaled to a future 2,000-core CPU-manycore heterogeneous system compared to an 18-core out-of-order CPU baseline, while potentially achieving higher area-normalized throughput and improved energyefficiency compared to general-purpose graphics processing units.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::4065043b998b1326c9077db3d59bc0b2 https://doi.org/10.1109/tcad.2021.3103825 Zobrazit plný text záznamu