Stability of filter feature selection methods in data pipelines: a simulation study

Autor: Bertolini, Roberto, Finch, Stephen J.
Zdroj: International Journal of Data Science and Analytics; 20220101, Issue: Preprints p1-24, 24p
Abstrakt: Filter methods are a class of feature selection techniques used to identify a subset of informative features during data preprocessing. While the differential efficacy of these techniques has been extensively compared in data science pipelines for predictive outcome modeling, less work has examined how their stability is impacted by underlying corpora properties. A set of six stability metrics (Davis, Dice, Jaccard, Kappa, Lustgarten, and Novovičová) was compared during cross-validation in a Monte Carlo simulation study on synthetic data to examine variability in the stability of three filter methods in data pipelines for binary classification, considering five underlying data properties: (1) error of measurement in the independent covariates, (2) number of training observations, (3) number of features, (4) class imbalance magnitude, and (5) missing data pattern. Feature selection stability was platykurtic and was negatively impacted by measurement error and a smaller number of training observations included in the input corpora. The Novovičová stability metric yielded the highest mean stability values, while the Davis stability metric was the most unstable method. The distribution of all stability metrics was negatively skewed, and the Jaccard metric exhibited the largest amount of variability across all five data properties. A statistical analysis of the synergistic effects between filter feature selection techniques, filter cutoffs, data corpora properties, and machine learning (ML) algorithms on overall pipeline efficacy, quantified using the area the under curve (AUC) evaluation metric, is also presented and discussed.
Databáze: Supplemental Index