Random forest implementation and optimization for Big Data analytics on LexisNexis’s high performance computing cluster platform
Autor: | Flavio Villanustre, Taghi M. Khoshgoftaar, Borko Furht, Victor M. Herrera |
---|---|
Rok vydání: | 2019 |
Předmět: |
Distributed machine learning
lcsh:Computer engineering. Computer hardware Information Systems and Management Speedup Computer Networks and Communications Computer science Big data Partition problem LexisNexis’s high performance computing cluster (HPCC) systems platform lcsh:TK7885-7895 02 engineering and technology lcsh:QA75.5-76.95 Bottleneck Optimization for Big Data Turning recursion into iteration 020204 information systems 0202 electrical engineering electronic engineering information engineering lcsh:T58.5-58.64 lcsh:Information technology business.industry Supercomputer Partition (database) Random forest Computer engineering Hardware and Architecture Analytics 020201 artificial intelligence & image processing lcsh:Electronic computers. Computer science business Information Systems |
Zdroj: | Journal of Big Data, Vol 6, Iss 1, Pp 1-36 (2019) |
ISSN: | 2196-1115 |
DOI: | 10.1186/s40537-019-0232-1 |
Popis: | In this paper, we comprehensively explain how we built a novel implementation of the Random Forest algorithm on the High Performance Computing Cluster (HPCC) Systems Platform from LexisNexis. The algorithm was previously unavailable on that platform. Random Forest’s learning process is based on the principle of recursive partitioning and although recursion per se is not allowed in ECL (HPCC’s programming language), we were able to implement the recursive partition algorithm as an iterative split/partition process. In addition, we analyze the flaws found in our initial implementation and we thoroughly describe all the modifications required to overcome the bottleneck within the iterative split/partition process, i.e., the optimization of the data gathering of selected independent variables which are used for the node’s best-split analysis. Essentially, we describe how our initial Random Forest implementation has been optimized and has become an efficient distributed machine learning implementation for Big Data. By taking full advantage of the HPCC Systems Platform’s Big Data processing and analytics capabilities, we succeed in enhancing the data gathering method from an inefficient Pass them All and Filter approach into an effective and completely parallelized Fetching on Demand approach. Finally, based upon the results of our learning process runtime comparison between these two approaches, we confirm the speed up of our optimized Random Forest implementation. |
Databáze: | OpenAIRE |
Externí odkaz: |