Popis: |
This paper presents analysis and optimizations for High Performance Conjugate Gradient benchmark (HPCG) on the Sunway many-core processor. For modern multi-core and many-core processors, HPCG always presents a poor performance and under-utilizes computation resource because of its low arithmetic intensity and fine-grain parallelism. We apply two conventional methods to parallel Gauss-Seidel smoother the most time consumer kernel in HPCG, including Level-Scheduling (LS) and Multi-Coloring (MC). These strategies are effective and achieve 1.54x and 5.52x performance improvement. For overcoming the poor locality for MC and limited parallelism for LS, we propose a novel Hierarchical Grid (HG) algorithm and our algorithmic and architecture-aware optimizations achieve an aggregated performance of 3.54 Gflops, which is around 0.475% of the peak performance and 15.4x higher than reference on the single core-group of SW26010 processor. With MPI parallelize, we balance the parallelism, pre-processing, convergence rate and communication overheads, we achieved 192 TFlops (70% parallelization efficiency) when scaling to 81920 CGs (5,324,800 cores) on Sunway Taihulight System. Moreover, we analyze the adaptability of our parallel method and optimization strategies and summarize several key points when refactoring and optimizing HPC applications on the Sunway heterogeneous many-core processor. |