Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining

Autor:	Mengke Qiao, Ke-Wei Huang
Rok vydání:	2021
Předmět:	Information Systems and Management Observational error Computer Networks and Communications Computer science 05 social sciences Regression analysis 02 engineering and technology Library and Information Sciences computer.software_genre Management Information Systems 020204 information systems 0502 economics and business 0202 electrical engineering electronic engineering information engineering Statistical inference 050211 marketing Data mining Information bias Construct (philosophy) computer Performance metric Information Systems
Zdroj:	Information Systems Research. 32:462-480
ISSN:	1526-5536 1047-7047
DOI:	10.1287/isre.2020.0977
Popis:	There is a surge of interest in social science studies in applying data mining methods to construct variables for regression analysis. For example, text classification was applied to classify whether the review is subjective or objective. The derived review subjectivity was used as an independent variable in the regression to examine its impact on review helpfulness. In the classification phase of these studies, researchers need to subjectively choose a classification performance metric for optimization. No matter which performance metric is chosen, the constructed variable still includes classification error because the variable cannot be classified perfectly. The misclassification of constructed variables will lead to inconsistent estimators of regression coefficients in the following phase. To correct the estimation inconsistency, we summarize and modify existing proofs in econometrics to derive theoretical formulas of consistent estimators in generalized linear models. The main implication of our theoretical result is that the inconsistency can be corrected by theoretical formulas, even when the classification accuracy is poor. Therefore, we propose that a classification algorithm should be tuned to minimize the standard error of the focal coefficient derived based on the corrected formula. As a result, researchers derive a consistent and most precise estimator in generalized linear models.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::f6bda0b422d4dad1a2eb9f82a637bfbb https://doi.org/10.1287/isre.2020.0977 Zobrazit plný text záznamu