Performance-based active learning for skewed data with nonparametric logistic regression

Autor: Wonjae Lee, Kangwon Seo
Rok vydání: 2023
DOI: 10.21203/rs.3.rs-2953579/v1
Popis: Real-world data often exhibit skewed distribution with a long tail, where certain target values have significantly fewer observations rather than preserving an ideal uniform distribution over each category, which substantially affects model performance for classification problems. Furthermore, parametric logistic regression provides a fundamental classification model with ease of interpretation; however, it is doubtful that the logit function of classification is truly linear in covariates. This research proposes the performance-based active learning (PbAL) scheme with nonparametric logistic regression to address the imbalance problem considering the nonlinear decision boundary. The PbAL is applied to choose the most informative samples in a sequential manner with an imbalanced dataset by directly evaluating a performance metric on a pool set. The nonparametric logistic regression model with smoothing splines is used to achieve a flexible classification boundary. The experiments show that PbAL outperforms traditional active learning approaches based on D-optimality and A-optimality. It is also shown that the proposed method provides superior outcomes compared to the other resampling techniques used for imbalanced classification problems, such as Tomek Link and SMOTE, even with a smaller sample size. This result suggests that PbAL effectively mitigates the bias, which severely influences the model performance with small amounts of initial training data.
Databáze: OpenAIRE