High Availability on Jetstream

Autor:	George Turner, Sanjana Sudarshan, John Michael Lowe, Craig A. Stewart, David Y. Hancock, Jeremy Fischer
Rok vydání:	2018
Předmět:	Service (systems architecture) Firmware Computer science business.industry 05 social sciences 050301 education Cloud computing Service provider computer.software_genre Supercomputer 01 natural sciences 010305 fluids & plasmas Engineering management Software deployment High availability 0103 physical sciences Duration (project management) business 0503 education computer
Zdroj:	ScienceCloud@HPDC
DOI:	10.1145/3217880.3217884
Popis:	Research computing has traditionally used high performance computing (HPC) clusters and has been a service not given to high availability without a doubling of computational and storage capacity. System maintenance such as security patching, firmware updates, and other system upgrades generally meant that the system would be unavailable for the duration of the work unless one has redundant HPC systems and storage. While efforts were often made to limit downtimes, when it became necessary, maintenance windows might be one to two hours or as much as an entire day. As the National Science Foundation (NSF) began funding non-traditional research systems, looking at ways to provide higher availability for researchers became one focus for service providers. One of the design elements of Jetstream was to have geographic dispersion to maximize availability. This was the first step in a number of design elements intended to make Jetstream exceed the NSF's availability requirements. We will examine the design steps employed, the components of the system and how the availability for each was considered in deployment, how maintenance is handled, and the lessons learned from the design and implementation of the Jetstream cloud.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::008e1a81f82c3b7a1690f97abc513c2a https://doi.org/10.1145/3217880.3217884 Zobrazit plný text záznamu