A Taxonomy of Error Sources in HPC I/O Machine Learning Models

Autor:	Mihailo Isakov, Mikaela Currier, Eliakin del Rosario, Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert B. Ross, Glenn K. Lockwood, Michel A. Kinsy
Jazyk:	angličtina
Rok vydání:	2022
Předmět:	Performance (cs.PF) FOS: Computer and information sciences Computer Science - Performance Computer Science - Distributed Parallel and Cluster Computing Distributed Parallel and Cluster Computing (cs.DC)
Popis:	I/O efficiency is crucial to productivity in scientific computing, but the increasing complexity of the system and the applications makes it difficult for practitioners to understand and optimize I/O behavior at scale. Data-driven machine learning-based I/O throughput models offer a solution: they can be used to identify bottlenecks, automate I/O tuning, or optimize job scheduling with minimal human intervention. Unfortunately, current state-of-the-art I/O models are not robust enough for production use and underperform after being deployed. We analyze multiple years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models underperform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We develop litmus tests to quantify each category, allowing researchers to narrow down failure modes, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::124d38a91a9303f0c7d5064b595c3dcc http://arxiv.org/abs/2204.08180 Zobrazit plný text záznamu