Synergistic effects between data corpora properties and machine learning performance in data pipelines

Autor: Bertolini, Roberto, Finch, Stephen J.
Zdroj: International Journal of Data Mining, Modelling and Management; 2022, Vol. 14 Issue: 3 p217-233, 17p
Abstrakt: To analyse data, a computationally feasible pipeline must be developed for data modelling. Corpora properties affect performance variability of machine learning (ML) techniques in pipelines; however, this has not been thoroughly investigated using simulation methodologies. A Monte Carlo study is used to compare differences in the area under the curve (AUC) metric for large-n-small-p-corpora examining: 1) the choice of ML algorithm; 2) size of the training database; 3) measurement error; 4) class imbalance magnitude; 5) missing data pattern. Our simulations are consistent with established results under which these algorithms and corpora properties perform best, while providing insights into their synergistic effects. Measurement error negatively impacted pipeline performance across all corpora factors and ML algorithms. A larger training corpus ameliorated the decrease in predictive efficacy resulting from measurement error, class imbalance magnitudes, and missing data patterns. We discuss the implications of these findings for designing pipelines to enhance prediction performance.
Databáze: Supplemental Index