Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems
Autor: | Carlos Mougan, Oriol Pujol, David Masip, Jordi Nin |
---|---|
Rok vydání: | 2021 |
Předmět: |
Computer science
business.industry Regression analysis 02 engineering and technology Overfitting Machine learning computer.software_genre ComputingMethodologies_PATTERNRECOGNITION Cardinality 020204 information systems 0202 electrical engineering electronic engineering information engineering Feature (machine learning) 020201 artificial intelligence & image processing Artificial intelligence Additive smoothing business Encoder Categorical variable computer Quantile |
Zdroj: | Modeling Decisions for Artificial Intelligence ISBN: 9783030855284 MDAI Lecture Notes in Computer Science Lecture Notes in Computer Science-Modeling Decisions for Artificial Intelligence |
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-3-030-85529-1_14 |
Popis: | Regression problems have been widely studied in machine learning literature resulting in a plethora of regression models and performance measures. However, there are few techniques specially dedicated to solve the problem of how to incorporate categorical features to regression problems. Usually, categorical feature encoders are general enough to cover both classification and regression problems. This lack of specificity results in underperforming regression models. In this paper, we provide an in-depth analysis of how to tackle high cardinality categorical features with the quantile. Our proposal outperforms state-of-the-art encoders, including the traditional statistical mean target encoder, when considering the Mean Absolute Error, especially in the presence of long-tailed or skewed distributions. Besides, to deal with possible overfitting when there are categories with small support, our encoder benefits from additive smoothing. Finally, we describe how to expand the encoded values by creating a set of features with different quantiles. This expanded encoder provides a more informative output about the categorical feature in question, further boosting the performance of the regression model. |
Databáze: | OpenAIRE |
Externí odkaz: |