EMOGI

Autor:	Zaid Qureshi, Vikram Sharma Mailthody, Wen-mei W. Hwu, Jinjun Xiong, Eiman Ebrahimi, Seungwon Min
Rok vydání:	2020
Předmět:	FOS: Computer and information sciences 010302 applied physics Speedup Computer science General Engineering Databases (cs.DB) 020207 software engineering 02 engineering and technology Parallel computing 01 natural sciences Out of memory Data access Computer Science - Distributed Parallel and Cluster Computing Computer Science - Databases 0103 physical sciences Graph traversal 0202 electrical engineering electronic engineering information engineering Bandwidth (computing) Distributed Parallel and Cluster Computing (cs.DC) Massively parallel Auxiliary memory PCI Express
Zdroj:	Proceedings of the VLDB Endowment. 14:114-127
ISSN:	2150-8097
Popis:	Modern analytics and recommendation systems are increasingly based on graph data that capture the relations between entities being analyzed. Practical graphs come in huge sizes, offer massive parallelism, and are stored in sparse-matrix formats such as compressed sparse row (CSR). To exploit the massive parallelism, developers are increasingly interested in using GPUs for graph traversal. However, due to their sizes, graphs often do not fit into the GPU memory. Prior works have either used input data pre-processing/partitioning or unified virtual memory (UVM) to migrate chunks of data from the host memory to the GPU memory. However, the large, multi-dimensional, and sparse nature of graph data presents a major challenge to these schemes and results in significant amplification of data movement and reduced effective data throughput. In this work, we propose EMOGI, an alternative approach to traverse graphs that do not fit in GPU memory using direct cache-line-sized access to data stored in host memory. This paper addresses the open question of whether a sufficiently large number of overlapping cache-line-sized accesses can be sustained to 1) tolerate the long latency to host memory, 2) fully utilize the available bandwidth, and 3) achieve favorable execution performance. We analyze the data access patterns of several graph traversal applications in GPU over PCIe using an FPGA to understand the cause of poor external bandwidth utilization. By carefully coalescing and aligning external memory requests, we show that we can minimize the number of PCIe transactions and nearly fully utilize the PCIe bandwidth with direct cache-line accesses to the host memory. EMOGI achieves 2.60X speedup on average compared to the optimized UVM implementations in various graph traversal applications. We also show that EMOGI scales better than a UVM-based solution when the system uses higher bandwidth interconnects such as PCIe 4.0.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::3f2da832b6ef0cd6a1730d2e63a31189 https://doi.org/10.14778/3425879.3425883 Zobrazit plný text záznamu