Popis: |
In this paper, we build a new dataset UIT-ViON (Vietnamese Online Newspaper) collected from well-known online newspapers in Vietnamese. We collect, process, and create the dataset, then experiment with different machine learning models. In particular, we propose an open-domain, large-scale, and high-quality dataset consisting of 260,000 textual data points annotated with multiple labels for evaluating Vietnamese short text classification. In addition, we present the proposed approach using transformer-based learning (PhoBERT) for Vietnamese short text classification on the dataset, which outperforms traditional machine learning (Naive Bayes and Logistic Regression) and deep learning (Text-CNN and LSTM). As a result, the proposed approach achieves the F1-score of 80.62%. This is a positive result and a premise for developing an automatic news classification system. The study is proposed to significantly save time, costs, and human resources and make it easier for readers to find news related to their interesting topics. In future, we will propose solutions to improve the quality of the dataset and improve the performance of classification models. |