A novel ensemble-based paradigm to process large-scale data.

Autor: Trinh, Thanh, Le, HoangAnh, VuongThi, Nhung, HoangDuc, Hai, VuThi, KieuAnh
Zdroj: Multimedia Tools & Applications; Mar2024, Vol. 83 Issue 9, p26663-26685, 23p
Abstrakt: Big data analytics is an emerging topic in academic and industrial engineering fields, where the large-scale data issue is the most attractive challenge. It is crucial to design an effective large-scale data processing model to handle big data. In this paper, we aim to improve the accuracy of the classification task and reduce the execution time for large-scale data within a small cluster. In order to overcome these challenges, this paper presents a novel ensemble-based paradigm that consists of the procedure of splitting large-scale data files and developing ensemble models. Two different splitting methods are first developed to partition large-scale data into small data blocks without overlapping. Then we propose two ensemble-based methods with high accuracy and less execution time: bagging-based and boosting-based methods. Finally, the proposed paradigm can be implemented by four predictive models, which are combinations of two splitting methods and two ensemble-based methods. A series of persuasive experiments was conducted to evaluate the effectiveness of the proposed paradigm with four different combinations. Overall, the proposed paradigm with boosting-based is the best in terms of the accuracy metric compared with existing methods. In addition, boosting-based methods achieve 91.6% accuracy compared with 52% accuracy of base line model for a big data file with 10 million samples. However, the paradigm with bagging-based takes the least execution time to yield results. This paper also reveals the effectiveness of the computing Spark cluster for large-scale data and points out the weakness of RDD (Resilient Distributed dataset). [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index