Random forest implementation and optimization for Big Data analytics on LexisNexis’s high performance computing cluster platform
Autor: | Victor M. Herrera, Taghi M. Khoshgoftaar, Flavio Villanustre, Borko Furht |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2019 |
Předmět: |
Random forest
LexisNexis’s high performance computing cluster (HPCC) systems platform Optimization for Big Data Distributed machine learning Turning recursion into iteration Computer engineering. Computer hardware TK7885-7895 Information technology T58.5-58.64 Electronic computers. Computer science QA75.5-76.95 |
Zdroj: | Journal of Big Data, Vol 6, Iss 1, Pp 1-36 (2019) |
Druh dokumentu: | article |
ISSN: | 2196-1115 |
DOI: | 10.1186/s40537-019-0232-1 |
Popis: | Abstract In this paper, we comprehensively explain how we built a novel implementation of the Random Forest algorithm on the High Performance Computing Cluster (HPCC) Systems Platform from LexisNexis. The algorithm was previously unavailable on that platform. Random Forest’s learning process is based on the principle of recursive partitioning and although recursion per se is not allowed in ECL (HPCC’s programming language), we were able to implement the recursive partition algorithm as an iterative split/partition process. In addition, we analyze the flaws found in our initial implementation and we thoroughly describe all the modifications required to overcome the bottleneck within the iterative split/partition process, i.e., the optimization of the data gathering of selected independent variables which are used for the node’s best-split analysis. Essentially, we describe how our initial Random Forest implementation has been optimized and has become an efficient distributed machine learning implementation for Big Data. By taking full advantage of the HPCC Systems Platform’s Big Data processing and analytics capabilities, we succeed in enhancing the data gathering method from an inefficient Pass them All and Filter approach into an effective and completely parallelized Fetching on Demand approach. Finally, based upon the results of our learning process runtime comparison between these two approaches, we confirm the speed up of our optimized Random Forest implementation. |
Databáze: | Directory of Open Access Journals |
Externí odkaz: |