Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems
Autor: | Jeff Ogden, Sophia Lefantzi, Ann C. Gentile, Anthony Agelastos, Joel O. Stevenson, Jim Brandt, Benjamin A. Allan, Mahesh Rajan, Steve Monk |
---|---|
Rok vydání: | 2016 |
Předmět: |
SIMPLE (military communications protocol)
Computer Networks and Communications Computer science Real-time computing 020206 networking & telecommunications 02 engineering and technology Computer Graphics and Computer-Aided Design Theoretical Computer Science Procurement Artificial Intelligence Hardware and Architecture Scalability 0202 electrical engineering electronic engineering information engineering Systems engineering Production (economics) 020201 artificial intelligence & image processing Throughput (business) Software |
Zdroj: | Parallel Computing. 58:90-106 |
ISSN: | 0167-8191 |
DOI: | 10.1016/j.parco.2016.05.009 |
Popis: | Monitoring can provide meaningful system and application profiling in production.Visual and analytical characterizations can inform usage and procurement decisions.Resource utilization scoring provides simple but informative characterizations.Continuous, synchronous, high-fidelity, whole-system monitoring is required. A detailed understanding of HPC applications' resource needs and their complex interactions with each other and HPC platform resources are critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands.In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads. |
Databáze: | OpenAIRE |
Externí odkaz: |