Abusive Comments Detection in Bangla-English Code-mixed and Transliterated Text

Autor: Istiak Ahamed, Swakkhar Shatabda, Maliha Jahan, Md. Rayanuzzaman Bishwas
Rok vydání: 2019
Předmět:
Zdroj: 2019 2nd International Conference on Innovation in Engineering and Technology (ICIET).
DOI: 10.1109/iciet48527.2019.9290630
Popis: The comment section in public websites, while reflecting public opinion and enabling people to provide constructive criticism or to show appreciation, can be viewed by some people as a stage to use vulgar and offensive words without any consequences. With the rising popularity of micro-blogging websites like Facebook, Twitter etc., Bangla Language speakers’ tendency to use code-mixing and transliteration is increasing as well. Manually checking and removing abusive comments from public websites can get tedious, which is undesirable in the present day of technological automation. In this paper, we propose a method to detect abusive comments using Machine Learning algorithms. This paper works not only with Bangla text but also with Bangla-English code-mixed text and transliterated Bangla text. The proposed method involves great amount of preprocessing as a result of people’s disregard for correct spelling, grammar and punctuation when it comes to writing comments on the internet. For the dataset, we collected comments from public Facebook pages along with the number of likes they got. For features, we used Unigrams, Bigrams, number of likes, emojis along with their categories, sentiment scores, offensive and threatening words used in the comments, detected using our proposed algorithm, and the number of abusive words in each comment. The aforementioned algorithm can detect profanitypes too. After experimenting with three Machine Learning algorithms, namely Support Vector Machine, Random Forest, and Adaboost, the proposed method achieved a highest accuracy of 72.14%.
Databáze: OpenAIRE