Evaluating Mutual Information and Chi-Square Metrics in Text Features Selection Process: A Study Case Applied to the Text Classification in PubMed
Autor: | Alfredo Simón-Cuevas, Christian Torres-Morán, Fernando Rojas, Rodolfo García-Bermúdez, José Párraga-Valle |
---|---|
Rok vydání: | 2020 |
Předmět: | |
Zdroj: | Bioinformatics and Biomedical Engineering ISBN: 9783030453848 IWBBIO |
DOI: | 10.1007/978-3-030-45385-5_57 |
Popis: | The aim of this work was to compare the behavior of mutual information and Chi-square as metrics in the evaluation of the relevance of the terms extracted from documents related to “software design” retrieved from PubMed database tested in two contexts: using a set of terms retrieved from the vectorization of the corpus of abstracts and using only the terms retrieved from the vocabulary defined by the IEEE standard ISO/IEC/IEEE 24765. A search was conducted concerning the subject “software” in the last 6 years and we used Medical Subject Headings (Mesh) term “software design” of the articles to label them. Then mutual information and Chi-square metrics were computed as metrics to sort and select features. Chi-square obtained the highest accuracy scores in documents classification by using a multinomial naive Bayes classifier. Although these results suggest that Chi-square is better than mutual information in feature relevance estimation in the context of this work, further research is necessary to obtain a consistent foundation of this conclusion. |
Databáze: | OpenAIRE |
Externí odkaz: |