Implementing directed acyclic graphs with the heterogeneous system architecture
Autor: | Wei Wu, Sooraj Puthoor, Bradford M. Beckmann, Shuai Che, Mayank Daga, Gregory Rodgers, Ashwin M. Aji |
---|---|
Rok vydání: | 2016 |
Předmět: |
Heterogeneous System Architecture
Speedup Task management Computer science Distributed computing 0206 medical engineering Symmetric multiprocessor system 02 engineering and technology Parallel computing Directed acyclic graph Task (computing) Shared memory 020204 information systems 0202 electrical engineering electronic engineering information engineering Programming paradigm 020602 bioinformatics |
Zdroj: | GPGPU@PPoPP |
Popis: | Achieving optimal performance on heterogeneous computing systems requires a programming model that supports the execution of asynchronous, multi-stream, and out-of-order tasks in a shared memory environment. Asynchronous dependency-driven tasking is one such programming model that allows the computation to be expressed as a directed acyclic graph (DAG) and exposes fine-grain task management to the programmer. The use of DAGs to extract parallelism also enables runtimes to perform dynamic load-balancing, thereby achieving higher throughput when compared to the traditional bulk-synchronous execution. However, efficient DAG implementations require features such as user-level task dispatch, hardware signalling and local barriers to achieve low-overhead task dispatch and dependency resolution.In this paper, we demonstrate that the Heterogeneous System Architecture (HSA) exposes the above capabilities, and we validate their benefits by implementing three well-referenced applications using fine-grain tasks: Cholesky factorization, Lower Upper Decomposition (LUD), and Needleman-Wunsch (NW). HSA's user-level task dispatch and signalling capability allow work to be launched and dependencies to be managed directly by the hardware, avoiding inefficient bulk-synchronization. Our results show the HSA task-based implementations of Cholesky, LUD, and NW are representative of this emerging class of workloads and using hardware-managed tasks achieve a speedup of 3.8x, 1.6x, and 1.5x, respectively, compared to bulk-synchronous implementations. |
Databáze: | OpenAIRE |
Externí odkaz: |