High Availability on Jetstream

Autor: George Turner, Sanjana Sudarshan, John Michael Lowe, Craig A. Stewart, David Y. Hancock, Jeremy Fischer
Rok vydání: 2018
Předmět:
Zdroj: ScienceCloud@HPDC
DOI: 10.1145/3217880.3217884
Popis: Research computing has traditionally used high performance computing (HPC) clusters and has been a service not given to high availability without a doubling of computational and storage capacity. System maintenance such as security patching, firmware updates, and other system upgrades generally meant that the system would be unavailable for the duration of the work unless one has redundant HPC systems and storage. While efforts were often made to limit downtimes, when it became necessary, maintenance windows might be one to two hours or as much as an entire day. As the National Science Foundation (NSF) began funding non-traditional research systems, looking at ways to provide higher availability for researchers became one focus for service providers. One of the design elements of Jetstream was to have geographic dispersion to maximize availability. This was the first step in a number of design elements intended to make Jetstream exceed the NSF's availability requirements. We will examine the design steps employed, the components of the system and how the availability for each was considered in deployment, how maintenance is handled, and the lessons learned from the design and implementation of the Jetstream cloud.
Databáze: OpenAIRE