Optimality of Training/Test Size and Resampling Effectiveness of Cross-Validation Estimators of the Generalization Error

Autor: Afendras, Georgios, Markatou, Marianthi
Rok vydání: 2015
Předmět:
Druh dokumentu: Working Paper
Popis: An important question in constructing Cross Validation (CV) estimators of the generalization error is whether rules can be established that allow "optimal" selection of the size of the training set, for fixed sample size $n$. We define the {\it resampling effectiveness} of random CV estimators of the generalization error as the ratio of the limiting value of the variance of the CV estimator over the estimated from the data variance. The variance and the covariance of different average test set errors are independent of their indices, thus, the resampling effectiveness depends on the correlation and the number of repetitions used in the random CV estimator. We discuss statistical rules to define optimality and obtain the "optimal" training sample size as the solution of an appropriately formulated optimization problem. We show that in a broad class of loss functions the optimal training size equals half of the total sample size, independently of the data distribution. We optimally select the number of folds in $k$-fold cross validation and offer a computational procedure for obtaining the optimal splitting in the case of classification (via logistic regression). We substantiate our claims both, theoretically and empirically.
Comment: 53 pages, 6 figures, 16 tables
Databáze: arXiv