Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining

Autor: Mengke Qiao, Ke-Wei Huang
Rok vydání: 2021
Předmět:
Zdroj: Information Systems Research. 32:462-480
ISSN: 1526-5536
1047-7047
DOI: 10.1287/isre.2020.0977
Popis: There is a surge of interest in social science studies in applying data mining methods to construct variables for regression analysis. For example, text classification was applied to classify whether the review is subjective or objective. The derived review subjectivity was used as an independent variable in the regression to examine its impact on review helpfulness. In the classification phase of these studies, researchers need to subjectively choose a classification performance metric for optimization. No matter which performance metric is chosen, the constructed variable still includes classification error because the variable cannot be classified perfectly. The misclassification of constructed variables will lead to inconsistent estimators of regression coefficients in the following phase. To correct the estimation inconsistency, we summarize and modify existing proofs in econometrics to derive theoretical formulas of consistent estimators in generalized linear models. The main implication of our theoretical result is that the inconsistency can be corrected by theoretical formulas, even when the classification accuracy is poor. Therefore, we propose that a classification algorithm should be tuned to minimize the standard error of the focal coefficient derived based on the corrected formula. As a result, researchers derive a consistent and most precise estimator in generalized linear models.
Databáze: OpenAIRE