A Distributed Support Vector Machine Using Apache Spark for Semi-supervised Classification with Data Augmentation

Autor: Suresh Kumar Nagarajan, S.S. Blessy Trencia Lincy
Rok vydání: 2019
Předmět:
Zdroj: Advances in Intelligent Systems and Computing ISBN: 9789811333927
DOI: 10.1007/978-981-13-3393-4_41
Popis: One of the popular and extensively used classification algorithms in the data mining and the machine learning technique is the support vector machine (SVM). Yet, conversely they have been traditionally applied to a small dataset or to an extent medium dataset. The current requirement and demand to scale up with the evolving size of the datasets have fascinated the research notice and attention such that new techniques and implementations can be carried out for the SVM, and as a result can scale well with large datasets and tasks. Recently, the distributed SVM is studied by the researchers, but the data augmentation with semi-supervised classification using the distributed SVM is not yet implemented. In this paper, a distributed implementation of support vector machine along with the data augmentation upon the SparkR, which is a recent and effective platform for performing distributed computation, is introduced and analyzed. This framework—A Distributed Support Vector Machine under Apache Spark for Semi-supervised Classification with Smart Data Augmentation—is implemented with a large-scale dataset with more than million data points. The results and analysis show that the proposed approach greatly enhances the predictive performance of the method in terms of execution time and faster processing.
Databáze: OpenAIRE