Popis: |
Stream computing is a suitable approach to improve both performance and power efficiency of numerical computations with FPGAs. To achieve further performance gain, temporal and spatial parallelism were exploited: the first one deepens and the latter duplicates pipelines of streamed computation cores. These two types of parallelism were previously evaluated with Arria 10 FPGA. However, it has not been verified if they are also effective for the latest FPGA, Stratix 10, which has a larger amount of logic elements (i.e., 2.4X of Arria 10) and is equipped with a new feature to improve the maximum clock frequency (i.e., HyperFlex architecture). To show the scalability for such state-of-the-art FPGAs, in this paper, we firstly implemented a streamed fluid simulation accelerator with both parallelism types for Stratix 10. We then thoroughly evaluated it by obtaining computational performance (FLOPS), power efficiency (FLOPS/W), resource utilization, and maximum clock frequency (Fmax). From the results, we found that this implementation excessively used DSP blocks due to inefficient mapping of floating-point operations, which reduced Fmax and the number of pipelined cores. To improve the scalability, we optimized the implementation to reduce the DSP block usage by utilizing a Multiply-Add function in a single DSP block. As a result, the optimized fluid simulation achieves 1.06 TFLOPS and 12.6 GFLOPS/W, which is 1.36X and 1.24X higher than the non-optimized version, respectively. Moreover, we estimate that the fluid simulation with Stratix 10 could outperform GPU-based implementation with Tesla V100 by optimizing it for HyperFlex architecture. |