High Availability on Jetstream
Autor: | George Turner, Sanjana Sudarshan, John Michael Lowe, Craig A. Stewart, David Y. Hancock, Jeremy Fischer |
---|---|
Rok vydání: | 2018 |
Předmět: |
Service (systems architecture)
Firmware Computer science business.industry 05 social sciences 050301 education Cloud computing Service provider computer.software_genre Supercomputer 01 natural sciences 010305 fluids & plasmas Engineering management Software deployment High availability 0103 physical sciences Duration (project management) business 0503 education computer |
Zdroj: | ScienceCloud@HPDC |
DOI: | 10.1145/3217880.3217884 |
Popis: | Research computing has traditionally used high performance computing (HPC) clusters and has been a service not given to high availability without a doubling of computational and storage capacity. System maintenance such as security patching, firmware updates, and other system upgrades generally meant that the system would be unavailable for the duration of the work unless one has redundant HPC systems and storage. While efforts were often made to limit downtimes, when it became necessary, maintenance windows might be one to two hours or as much as an entire day. As the National Science Foundation (NSF) began funding non-traditional research systems, looking at ways to provide higher availability for researchers became one focus for service providers. One of the design elements of Jetstream was to have geographic dispersion to maximize availability. This was the first step in a number of design elements intended to make Jetstream exceed the NSF's availability requirements. We will examine the design steps employed, the components of the system and how the availability for each was considered in deployment, how maintenance is handled, and the lessons learned from the design and implementation of the Jetstream cloud. |
Databáze: | OpenAIRE |
Externí odkaz: |