Towards a Unified Monitoring Framework for Power, Performance and Thermal Metrics: A Case Study on the Evaluation of HPC Cooling Systems

Autor: Aniruddha Marathe, Barry Rountree, Ghaleb Abdulla, Kathleen Shoga
Rok vydání: 2017
Předmět:
Zdroj: IPDPS Workshops
Popis: We present a unifying approach to monitoring and analyzing various metrics crucial in understanding the operational characteristics at different levels of HPC systems. Increase in the performance of HPC-scale processors has been closely followed by an increase in the power draw of the processors and the scale of HPC systems. Consequently, the relationship between the thermal and power characteristics of the system, from processor-level to the cluster-level is becoming more complex. Our monitoring framework effectively brings together operational metrics collected by hardware and software monitoring components at the HPC cluster level and subsystem component level to enable a comprehensive analysis of these characteristics. We show the effectiveness of our unified monitoring capability through a comparative study of the efficiency of traditional air-cooling and a liquid-cooling retro-fit on our large-scale HPC system. Using our unified monitoring framework we are able to show, for the first time at our facility, that the liquid-cooled HPC system achieves significantly lower and more stable ambient temperatures in both temporal and spatial dimensions, lower temperature disparity across subsystem components and better system power efficiency than the air-cooled HPC system.
Databáze: OpenAIRE