Autor: |
Hristov, Hristo, Momcheva, Galina, Pasheva, Vesela, Popivanov, Nedyu, Venkov, George |
Předmět: |
|
Zdroj: |
AIP Conference Proceedings; 2020, Vol. 2333 Issue 1, p1-9, 9p |
Abstrakt: |
Regression is a powerful technique for predicting a single scalar value with a high degree of certainty. A regression model requires a dataset that consists of only numeric features. However, datasets frequently contain both numeric and categorical features. This paper sets out to study several different text encoding techniques, which can solve this problem by transforming a certain text value into a corresponding numerical value that is statistically sane in relation to the dataset. The scenario we study is a neural network regression model predicting support case durations. Many of the dataset's features are indeed categorical such as issue description, team, username, severity and impact among others. The most challenging aspect of the encoding is the resulting change of the dimensionality of the dataset. Every encoding method affects the dimensionality in various degrees depending on the feature cardinality. This result is the main challenge for the tuning of the neural network hyperparameters. We must make such a setup that can robustly handle the altered dataset. The paper compares five different approaches for encoding: one-hot, hashing, binary, target and entity embeddings. The hyperparameter settings for each approach are presented by using common neural network performance metrics and a baseline neural network setup. We can conclude that a moderately increased dimensionality can enhance the model's predictive power as observed in the case of the binary and the hashing encoder [ABSTRACT FROM AUTHOR] |
Databáze: |
Complementary Index |
Externí odkaz: |
|