Roman Urdu toxic comment classification
Autor: | Asim Karim, Faisal Kamiran, Hafiz Hassaan Saeed, Toon Calders, Muhammad Haseeb Ashraf |
---|---|
Rok vydání: | 2021 |
Předmět: |
050101 languages & linguistics
Linguistics and Language South asia Computer science media_common.quotation_subject 02 engineering and technology Representation (arts) Library and Information Sciences computer.software_genre Language and Linguistics Education 0202 electrical engineering electronic engineering information engineering 0501 psychology and cognitive sciences Word2vec Social media media_common Computer. Automation business.industry 05 social sciences Popularity language.human_language Agreement Identification (information) language 020201 artificial intelligence & image processing Artificial intelligence Urdu Computational linguistics business computer Natural language processing |
Zdroj: | Language resources and evaluation |
ISSN: | 1574-0218 1574-020X |
DOI: | 10.1007/s10579-021-09530-y |
Popis: | With the increasing popularity of user-generated content on social media, the number of toxic texts is also on the rise. Such texts cause adverse effects on users and society at large, therefore, the identification of toxic comments is a growing need of the day. While toxic comment classification has been studied for resource-rich languages like English, no work has been done for Roman Urdu despite being a widely used language on social media in South Asia. This paper addresses the challenge of Roman Urdu toxic comment detection by developing a first-ever large labeled corpus of toxic and non-toxic comments. The developed corpus, called RUT (Roman Urdu Toxic), contains over 72 thousand comments collected from popular social media platforms and has been labeled manually with a strong inter-annotator agreement. With this dataset, we train several classification models to detect Roman Urdu toxic comments, including classical machine learning models with the bag-of-words representation and some recent deep models based on word embeddings. Despite the success of the latter in classifying toxic comments in English, the absence of pre-trained word embeddings for Roman Urdu prompted to generate different word embeddings using Glove, Word2Vec and FastText techniques, and compare them with task-specific word embeddings learned inside the classification task. Finally, we propose an ensemble approach, reaching our best F1-score of 86.35%, setting the first-ever benchmark for toxic comment classification in Roman Urdu. |
Databáze: | OpenAIRE |
Externí odkaz: |