Comparisons of Classification Methods with Small Data

Autor: Yu-Wen Kao, 高郁雯
Rok vydání: 2018
Druh dokumentu: 學位論文 ; thesis
Popis: 106
Background: To predict the occurrence of disease and classify it into different types of symptoms amongst a large number of variables, researchers often use statistical modeling or machine learning methods in the recent year for the binary outcome such as logistic regression and support vector machines (SVM). Logistic regression could quantify the associations between the predictors and outcomes and estimate the probability of the outcome event if the sampling is prospective, and this model is widely used in medical studies. Previous studies have shown that small numbers of events may cause problematic estimates of the parameter in regression models. However, there are few studies for the effect of event numbers on the outcome classification. The impact of event numbers on the performance of SVM has little been discussed. Objectives: The aim of this study was to investigate the effect of proportion of events on the performance of classifications by using logistic regression and by SVM, respectively, when data contained a large numbers of predictors. Methods: In logistic regression, we compared different variable selection methods such as stepwise selection based on Akaike information criterion (stepwise AIC), stepwise selection based on Bayesian information criterion (stepwise BIC), Least Absolute Shrinkage and Selection Operator for variable selection (LASSO). In SVM, we compared different kinds of kernel function such as linear, radial basis function (RBF), polynomial. Since the data generated for simulation studies were based on logistic model with pre-specified odds ratios, the present study did not compare the performance logistic regression and SVM. The scenarios of the simulation studies included different training and test sizes, different parameter settings in logistic regression, and different proportion of events at baseline. Training and test sizes were considered as 100,200,500 and 50, 100.Parameter setting were considered that the relationship between the dependent variable after logic transformation and independent variables (x) were linear (setting I) and non-linear (setting II). The proportions of events at baseline were considered as (20%, 50%, and 80%). We used Area Under ROC (AUC) to evaluate the performance of classification methods and calculated the total time of model estimation. Results and Conclusions: For the AUC performance, LASSO performs better if uses logistic regression for classification, and three kinds of kernel function is not much different if use SVM for classification. For computing time, LASSO is significantly faster than stepwise selection, and linear kernel SVM has less time-consuming. When the proportion of events at baseline is average, the AUC of logistic regression and SVM usually perform better. In comparisons, the performance for classification in setting I (linear) is better than in setting II (non-linear).
Databáze: Networked Digital Library of Theses & Dissertations