Deep reinforcement learning for fault-tolerant workflow scheduling in cloud environment.

Autor: Dong, Tingting, Xue, Fei, Tang, Hengliang, Xiao, Chuangbai
Předmět:
Zdroj: Applied Intelligence; May2023, Vol. 53 Issue 9, p9916-9932, 17p
Abstrakt: Cloud computing is widely used in various fields, which can provide sufficient computing resources to address users' demands (workflows) quickly and effectively. However, resource failure is inevitable, and a challenge to optimize the workflow scheduling is to consider the fault tolerance. Most of previous algorithms are based on failure prediction and fault-tolerant strategies, which can cause the time delay and waste of resources. In this paper, combining the above two methods through a deep reinforcement learning framework, an adaptive fault-tolerant workflow scheduling framework called RLFTWS is proposed, aiming to minimize the makespan and resource usage rate. In this framework, the fault-tolerant workflow scheduling is formulated as a markov decision process. Resubmission and replication strategy are as two actions. A heuristic algorithm is designed for the task allocation and execution according to the selected fault-tolerant strategy. And, double deep Q network framework (DDQN) is developed to select the fault-tolerant strategy adaptively for each task under the current environment state, which is not only prediction but also learning in the process of interacting with the environment. Simulation results show that the proposed RLFTWS can efficiently balance the makespan and resource usage rate, and achieve fault tolerance. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index