Breaking news: Unveiling a new dataset for Portuguese news classification and comparative analysis of approaches.

Autor: Klaifer Garcia, Pedro Shiguihara, Lilian Berton
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: PLoS ONE, Vol 19, Iss 1, p e0296929 (2024)
Druh dokumentu: article
ISSN: 1932-6203
DOI: 10.1371/journal.pone.0296929&type=printable
Popis: Every day thousands of news are published on the web and filtering tools can be used to extract knowledge on specific topics. The categorization of news into a predefined set of topics is a subject widely studied in the literature, however, most works are restricted to documents in English. In this work, we make two contributions. First, we introduce a Portuguese news dataset collected from WikiNews an open-source media that provide news from different sources. Since there is a lack of datasets for Portuguese, and an existing one is from a single news channel, we aim to introduce a dataset from different news channels. The availability of comprehensive datasets plays a key role in advancing research. Second, we compare different architectures for Portuguese news classification, exploring different text representations (BoW, TF-IDF, Embedding) and classification techniques (SVM, CNN, DJINN, BERT) for documents in Portuguese, covering classical methods and current technologies. We show the trade-off between accuracy and training time for this application. We aim to show the capabilities of available algorithms and the challenges faced in the area.
Databáze: Directory of Open Access Journals
Nepřihlášeným uživatelům se plný text nezobrazuje