Validation of the linear regression method to evaluate population accuracy and bias of predictions for non-linear models

Autor: Haipeng Yu, Rohan L Fernando, Jack CM Dekkers
Rok vydání: 2022
DOI: 10.1101/2022.10.02.510518
Popis: BackgroundThe linear regression method (LR) was proposed to estimate population bias and accuracy of predictions, while addressing the limitations of commonly used cross-validation methods. The validity and behavior of the LR method have been provided and studied for linear model predictions but not for non-linear models. The objectives of this study were to 1) provide a mathematical proof for the validity of the LR method when predictions are based on conditional mean, 2) explore the behavior of the LR method in estimating bias and accuracy of predictions when the model fitted is different from the true model, and 3) provide guidelines on how to appropriately partition the data into training and validation such that the LR method can identify presence of bias and accuracy in predictions.ResultsWe present a mathematical proof for the validity of the LR method to estimate bias and accuracy of predictions based on the conditional mean, including for non-linear models. Using simulated data, we show that the LR method can accurately detect bias and estimate accuracy of predictions when an incorrect model is fitted when the data is partitioned such that the values of relevant predictor variables differ in the training and validation sets. But the LR method fails when the data are not partitioned in that manner.ConclusionsThe LR method was proven to be a valid method to evaluate the population bias and accuracy of predictions based on the conditional mean, regardless of whether it is a linear or non-linear function of the data. The ability of the LR method to detect bias and estimate accuracy of predictions when the model fitted is incorrect depends on how the data are partitioned. To appropriately test the predictive ability of a model using the LR method, the values of the relevant predictor variables need to be different between the training and validation sets.
Databáze: OpenAIRE