FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing
Autor: | Martin Küttler, Jan Bierbaum, Amnon Barak, Alexander Reinefeld, Ely Levy, Hermann Härtig, Wolfgang E. Nagel, Amnon Shiloh, Jan Fajerski, Thorsten Schütt, Thomas Steinke, Maksym Planeta, Matthias Lieber, Tal Ben-Nun, Carsten Weinhold, Adam Lackorzynski |
---|---|
Rok vydání: | 2016 |
Předmět: |
020203 distributed computing
Computer science Busy waiting Fault tolerance 010103 numerical & computational mathematics 02 engineering and technology computer.software_genre 01 natural sciences Exascale computing Gossip Component (UML) 0202 electrical engineering electronic engineering information engineering Systems architecture Operating system State (computer science) Microkernel 0101 mathematics computer |
Zdroj: | Lecture Notes in Computational Science and Engineering ISBN: 9783319405261 Software for Exascale Computing |
DOI: | 10.1007/978-3-319-40528-5_18 |
Popis: | In this paper we describe the hardware and application-inherent challenges that future exascale systems pose to high-performance computing (HPC) and propose a system architecture that addresses them. This architecture is based on proven building blocks and few principles: (1) a fast light-weight kernel that is supported by a virtualized Linux for tasks that are not performance critical, (2) decentralized load and health management using fault-tolerant gossip-based information dissemination, (3) a maximally-parallel checkpoint store for cheap checkpoint/restart in the presence of frequent component failures, and (4) a runtime that enables applications to interact with the underlying system platform through new interfaces. The paper discusses the vision behind FFMK and the current state of a prototype implementation of the system, which is based on a microkernel and an adapted MPI runtime. |
Databáze: | OpenAIRE |
Externí odkaz: |