High Performance Computing (Hpc) in the Cloud: A Proactive Fault Tolerance (Pft) Strategy

Autor: Sharma, Sunil, Jain, Garima, D., Preethi, Bhardwaj, Shambhu
Jazyk: angličtina
Rok vydání: 2023
Předmět:
Zdroj: International Journal of Intelligent Systems and Applications in Engineering; Vol. 11 No. 8s (2023): Advances on Machine Learning and Artificial Intelligence in Computer Technology; 71-78
ISSN: 2147-6799
Popis: The High Performance Computing (HPC) applications benefit from the new paradigms for computers, capacity, and adaptable responses provided by cloud computing. For instance, the Hardware as a Service (HaaS) paradigm enables individuals to provide several Virtual Machines (VMs) for applications that need a lot of computing. Any execution error would require re-running applications, which would waste time, money, and energy since the HPC system on the cloud uses a lot of VMs and electrical components. In this research, the execution time on the clock and the cost when mistakes occur, we provided a Proactive Fault Tolerance (PFT) strategy to High Performance Computing systems in the cloud. Additionally, we created an enhanced PFT technique for cloud-based HPC systems. Before predicting a failure, our approach does not depend on a spare node. Also, we created a model cost for running computing-heavy apps on cloud HPC servers. To evaluate the effectiveness of our strategy, we looked at the monetary costs associated with supplying spare nodes and checkpointing PFT. Our experimental findings from a genuine cloud execution environment demonstrate that executing computation-intensive apps in the cloud may lower costs and execution times by up to 30%. Our PFT technique for HPC in the cloud may minimize the occurrence of checkpointing of computation-exhaustive applications by up to fifty percent when compared to existing PFT approaches.
Databáze: OpenAIRE