CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

Autor:	Twinkle Jain, Gene Cooperman
Jazyk:	angličtina
Rok vydání:	2020
Předmět:	FOS: Computer and information sciences Computer science 05 social sciences 050301 education Fault tolerance Parallel computing D.4.5 Software_PROGRAMMINGTECHNIQUES Supercomputer CUDA Memory management Computer Science - Distributed Parallel and Cluster Computing Scalability Virtual memory Overhead (computing) 0501 psychology and cognitive sciences Distributed Parallel and Cluster Computing (cs.DC) 0503 education Host (network) 050104 developmental & child psychology ComputingMethodologies_COMPUTERGRAPHICS
Zdroj:	SC
Popis:	The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues to grow. While fault tolerance is a critical issue for supercomputing, there does not currently exist an efficient, scalable solution for CUDA applications on NVIDIA GPUs. CRAC (Checkpoint-Restart Architecture for CUDA) is new checkpoint-restart solution for fault tolerance that supports the full range of CUDA applications. CRAC combines: low runtime overhead (approximately 1% or less); fast checkpoint-restart; support for scalable CUDA streams (for efficient usage of all of the thousands of GPU cores); and support for the full features of Unified Virtual Memory (eliminating the programmer's burden of migrating memory between device and host). CRAC achieves its flexible architecture by segregating application code (checkpointed) and its external GPU communication via non-reentrant CUDA libraries (not checkpointed) within a single process's memory. This eliminates the high overhead of inter-process communication in earlier approaches, and has fewer limitations. 24 pages, 6 figures, 3 tables; to appear in SC'20: The International Conference for High Performance Computing, Networking, Storage, and Analysis
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::7259050adfa78182a4d6477037b74082 http://arxiv.org/abs/2008.10596 Zobrazit plný text záznamu