An Empirical Investigation of Online News Classification on an Open-Domain, Large-Scale and High-Quality Dataset in Vietnamese

Autor:	Phap Ngoc Trinh, Khoa Nguyen-Anh Tran, Luan Van Ha, An Tran-Hoai Le, Khanh Quoc Tran, Kiet Van Nguyen
Rok vydání:	2021
Předmět:	Scale (ratio) Computer science media_common.quotation_subject Vietnamese Open domain language Quality (business) Data science language.human_language media_common
Zdroj:	SoMeT
Popis:	In this paper, we build a new dataset UIT-ViON (Vietnamese Online Newspaper) collected from well-known online newspapers in Vietnamese. We collect, process, and create the dataset, then experiment with different machine learning models. In particular, we propose an open-domain, large-scale, and high-quality dataset consisting of 260,000 textual data points annotated with multiple labels for evaluating Vietnamese short text classification. In addition, we present the proposed approach using transformer-based learning (PhoBERT) for Vietnamese short text classification on the dataset, which outperforms traditional machine learning (Naive Bayes and Logistic Regression) and deep learning (Text-CNN and LSTM). As a result, the proposed approach achieves the F1-score of 80.62%. This is a positive result and a premise for developing an automatic news classification system. The study is proposed to significantly save time, costs, and human resources and make it easier for readers to find news related to their interesting topics. In future, we will propose solutions to improve the quality of the dataset and improve the performance of classification models.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::7eeac8041406caa47a080009ef6ded9b https://doi.org/10.3233/faia210036 Zobrazit plný text záznamu