Autor: |
Kebriaei, Emad, Homayouni, Ali, Faraji, Roghayeh, Razavi, Armita, Shakery, Azadeh, Faili, Heshaam, Yaghoobzadeh, Yadollah |
Předmět: |
|
Zdroj: |
Machine Learning; Jul2024, Vol. 113 Issue 7, p4359-4379, 21p |
Abstrakt: |
With the proliferation of social networks and their impact on human life, one of the rising problems in this environment is the rise in verbal and written insults and hatred. As one of the significant platforms for distributing text-based content, Twitter frequently publishes its users' abusive remarks. Creating a model that requires a complete collection of offensive sentences is the initial stage in recognizing objectionable phrases. In addition, despite the abundance of resources in English and other languages, there are limited resources and studies on identifying hateful and offensive statements in Persian. In this study, we compiled a 38K-tweet dataset of Persian Hate and Offensive language using keyword-based data selection strategies. A Persian offensive lexicon and nine hatred target group lexicons were gathered through crowdsourcing for this purpose. The dataset was annotated manually so that at least two annotators investigated tweets. In addition, for the purpose of analyzing the effect of used lexicons on language model functionality, we employed two assessment criteria (FPED and pAUCED) to measure the dataset's potential bias. Then, by configuring the dataset based on the results of the bias measurement, we mitigated the effect of words' bias in tweets on language model performance. The results indicate that bias is significantly diminished, while less than a hundredth reduced the F1 score. [ABSTRACT FROM AUTHOR] |
Databáze: |
Complementary Index |
Externí odkaz: |
|