Abusive Comments Detection in Bangla-English Code-mixed and Transliterated Text
Autor: | Istiak Ahamed, Swakkhar Shatabda, Maliha Jahan, Md. Rayanuzzaman Bishwas |
---|---|
Rok vydání: | 2019 |
Předmět: |
Grammar
business.industry Computer science media_common.quotation_subject Bigram 05 social sciences 050801 communication & media studies computer.software_genre Punctuation Popularity language.human_language 03 medical and health sciences 0302 clinical medicine 0508 media and communications Bengali 030225 pediatrics language Transliteration The Internet Artificial intelligence AdaBoost business computer Natural language processing media_common |
Zdroj: | 2019 2nd International Conference on Innovation in Engineering and Technology (ICIET). |
DOI: | 10.1109/iciet48527.2019.9290630 |
Popis: | The comment section in public websites, while reflecting public opinion and enabling people to provide constructive criticism or to show appreciation, can be viewed by some people as a stage to use vulgar and offensive words without any consequences. With the rising popularity of micro-blogging websites like Facebook, Twitter etc., Bangla Language speakers’ tendency to use code-mixing and transliteration is increasing as well. Manually checking and removing abusive comments from public websites can get tedious, which is undesirable in the present day of technological automation. In this paper, we propose a method to detect abusive comments using Machine Learning algorithms. This paper works not only with Bangla text but also with Bangla-English code-mixed text and transliterated Bangla text. The proposed method involves great amount of preprocessing as a result of people’s disregard for correct spelling, grammar and punctuation when it comes to writing comments on the internet. For the dataset, we collected comments from public Facebook pages along with the number of likes they got. For features, we used Unigrams, Bigrams, number of likes, emojis along with their categories, sentiment scores, offensive and threatening words used in the comments, detected using our proposed algorithm, and the number of abusive words in each comment. The aforementioned algorithm can detect profanitypes too. After experimenting with three Machine Learning algorithms, namely Support Vector Machine, Random Forest, and Adaboost, the proposed method achieved a highest accuracy of 72.14%. |
Databáze: | OpenAIRE |
Externí odkaz: |