Popis: |
This work approaches the text document classification problem derived from the contest “Identify Keywords and Tags from Millions of Text Questions”, published on the website Kaggle. Using data from the StackOverflow website, the problem is to predict the tags assigned to questions. This categorization is multi-class and multi-tag, which means, a question can be assigned to different topics and can also have several tags. To solve this problem, we propose a 5-way multi-class classifier system. The results obtained by this classification scheme are discussed, by analysing certain score metrics of the classifier system. Competitive results were obtained by the 5-way classifier system, obtaining F1 scores ranging from 0.59 to 0.76. The main contribution of this paper lies on the preprocessing (which implements the feature extraction phase) and the multi-tag multi-class classification scheme. |