Autor: |
Liu, Zheng, Hao, Meng, Zhang, Weizhe, Lu, Gangzhao, Tian, Xueyang, Yang, Siyu, Xie, Mingdong, Dai, Jie, Yuan, Chenyu, Wang, Desheng, Yang, Hongwei |
Zdroj: |
CCF Transactions on High Performance Computing; 20240101, Issue: Preprints p1-19, 19p |
Abstrakt: |
The integration of Large Language Models (LLMs) with Convolutional Neural Networks (CNNs) is significantly advancing the development of large models. However, the computational cost of large models is high, necessitating optimization for greater efficiency. One effective way to optimize the CNN is the use of depthwise separable convolution (DSC), which decouples spatial and channel convolutions to reduce the number of parameters and enhance efficiency. In this study, we focus on porting and optimizing DSC kernel functions from the GPU to the Deep Computing Unit (DCU), a computing accelerator developed in China. For depthwise convolution, we implement a row data reuse algorithm to minimize redundant data loading and memory access overhead. For pointwise convolution, we extend our dynamic tiling strategy to improve hardware utilization by balancing resource allocation among blocks and threads, and we enhance arithmetic intensity through a channel distribution algorithm. We implement depthwise and pointwise convolution kernel functions and integrate them into PyTorch as extension modules. Experiments demonstrate that our optimized kernel functions outperform the MIOpen library on the DCU, achieving up to a 3.59×speedup in depthwise convolution and up to a 3.54×speedup in pointwise convolution. These results highlight the effectiveness of our approach in leveraging the DCU’s architecture to accelerate deep learning operations. |
Databáze: |
Supplemental Index |
Externí odkaz: |
|