Popis: |
Great amount of stored information used in connection with Machine Learning and statistical methods enables high quality insight and analysis of data that leads to design of high precision predictive and classification systems. In the process of analysis, selection of most informative features is crucial for later quality of the designed system. In this report, we propose two implementations of multidimensional feature selection (MDFS) algorithm (Piliszek et al. in Mdfs-multidimensional feature selection. arXiv preprint. arXiv:1811.00631, 2018) that can be used in distributed environments for detection of all-relevant variables in data sets with discrete decision variable. While most methods discard information about interactions between features, MDFS is designed towards identification of informative variables that are not relevant when considered alone but are relevant in groups. We have developed software using C++ and High Performance ParalleX (HPX) (Kaiser et al. in STEllAR-GROUP/hpx: HPX V1.3.0: the C++ Standards library for parallelism and concurrency. 2019. 10.5281/zenodo.3189323, 2019) to achieve best performance, great scalability and portability. HPX is a library that uses lightweight threads, asynchronous communication, and asynchronous task submission based on the declarative criteria of work. These features enabled us to deeply explore granularity and parallelism of the MDFS algorithm. Software is prepared entirely in C++; therefore, calculations can be performed using CPUs on desktops, distributed systems, and any system with C++ compiler support. During testing on Cray XC40 (Okeanos) using artificially prepared data, we achieved 196 times acceleration on 256 nodes compared to a single node. From this point, ICM computing facility is capable of massively parallel feature engineering. The main purpose of the software is to enable researchers for more accurate genomics data analysis in search for multiple correlations in potential sources of the diseases.   |