Efficient User-Level Storage Disaggregation for Deep Learning

Autor:	Kathryn Mohror, Yue Zhu, Adam Moody, Bing Jiao, Fahim Chowdhury, Weikuan Yu
Rok vydání:	2019
Předmět:	010302 applied physics File system Computer science business.industry ext4 Deep learning NVM Express Distributed computing CPU time 020206 networking & telecommunications 02 engineering and technology Directory computer.software_genre Supercomputer 01 natural sciences Non-volatile memory 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Artificial intelligence business Throughput (business) computer
Zdroj:	CLUSTER
Popis:	On large-scale high performance computing (HPC) systems, applications are provisioned with aggregated resources to meet their peak demands for brief periods. This results in resource underutilization because application requirements vary a lot during execution. This problem is particularly pronounced for deep learning applications that are running on leadership HPC systems with a large pool of burst buffers in the form of flash or non-volatile memory (NVM) devices. In this paper, we examine the I/O patterns of deep neural networks and reveal their critical need of loading many small samples randomly for successful training. We have designed a specialized Deep Learning File System (DLFS) that provides a thin set of APIs. Particularly, we design the metadata management of DLFS through an in-memory tree-based sample directory and its file services through the user-level SPDK protocol that can disaggregate the capabilities of NVM Express (NVMe) devices to parallel training tasks. Our experimental results show that DLFS can dramatically improve the throughput of training for deep neural networks on NVMe over Fabric, compared with the kernel-based Ext4 file system. Furthermore, DLFS achieves efficient user-level storage disaggregation with very little CPU utilization.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::2b32e60b225416a052b3c3a54a02462e https://doi.org/10.1109/cluster.2019.8891023 Zobrazit plný text záznamu