A Two-Stage Machine learning approach for temporally-robust text classification.

Autor: Salles, Thiago1 tsalles@dcc.ufmg.br, Rocha, Leonardo2 lcrocha@ufsj.edu.br, Mourão, Fernando2 fhmourao@ufsj.edu.br, Gonçalves, Marcos1 mgoncalv@dcc.ufmg.br, Viegas, Felipe1 frviegas@dcc.ufmg.br, Jr.Meira, Wagner1 meira@dcc.ufmg.br
Předmět:
Zdroj: Information Systems. Sep2017, Vol. 69, p40-58. 19p.
Abstrakt: One of the most relevant research topics in Information Retrieval is Automatic Document Classification (ADC). Several ADC algorithms have been proposed in the literature. However, the majority of these algorithms assume that the underlying data distribution does not change over time. Previous work has demonstrated evidence of the negative impact of three main temporal effects in representative datasets textual datasets, reflected by variations observed over time in the class distribution, in the pairwise class similarities and in the relationships between terms and classes [1]. In order to minimize the impact of temporal effects in ADC algorithms, we have previously introduced the notion of a temporal weighting function (TWF), which reflects the varying nature of textual datasets. We have also proposed a procedure to derive the TWF’s expression and parameters. However, the derivation of the TWF requires the running of explicit and complex statistical tests, which are very cumbersome or can not even be run in several cases. In this article, we propose a machine learning methodology to automatically learn the TWF without the need to perform any statistical tests. We also propose new strategies to incorporate the TWF into ADC algorithms, which we call temporally-aware classifiers . Experiments showed that the fully-automated temporally-aware classifiers achieved significant gains (up to 17%) when compared to their non-temporal counterparts, even outperforming some state-of-the-art algorithms (e.g., SVM) in most cases, with large reductions in execution time. [ABSTRACT FROM AUTHOR]
Databáze: Library, Information Science & Technology Abstracts