Pradvis vac: A socio-demographic dataset for determining the level of hatred severity in a low-resource Hinglish language
Autor: | Shankar Biradar, Sunil Saumya, Abhinav Kumar, Ashish Singh |
---|---|
Rok vydání: | 2022 |
Předmět: | |
Zdroj: | ACM Transactions on Asian and Low-Resource Language Information Processing. |
ISSN: | 2375-4702 2375-4699 |
DOI: | 10.1145/3573199 |
Popis: | In multilingual societies like India, mixing the native language with English has become common during social media conversations. Further, due to the government’s digitization push, more people from rural India are joining social media platforms, resulting in the exponential growth of native or code-mixed content. The resultant content on social media is available for both positive (also termed as Hope Speech) as well as negative context (also termed as Hate Speech). To keep the social media clean and hate free, it is important to remove the negative content using machine learning filters. Since most of the existing hate content prediction models are trained using high resource language such as English, they fail to work on code-mixed text due to its spelling variance and non-grammatical structure. In addition, the lack of suitable training data could be one reason behind existing models’ poor performance on code-mixed text. To address these issues and promote research in this direction, we developed a manually annotated Hinglish Code-mixed corpus of 9254 comments taken from Twitter handles. We also annotated our data with the target audience and severity level. In each label, we provided a more fine-grained classification with three independent classes, and we built a Multi-label and Multi-class corpus for the severity of hate content prediction in Hinglish code-mixed text. Further, we modeled various supervised classifiers for severity prediction to validate our proposed data. The proposed models employ transformers for feature extraction and different machine learning and RNN (Recurrent neural network) models for classification. According to the experimental results, the target label combined with embeddings from Twitter text using the BiLSTM (a varient of RNN) classifier performed better on severity prediction, attaining an acceptable weighted F1 score. |
Databáze: | OpenAIRE |
Externí odkaz: |