Building and Annotating a Codeswitched Hate Speech Corpora

Autor: Edward Ombui, Lawrence Muchemi, Peter Waiganjo Wagacha
Rok vydání: 2021
Předmět:
Zdroj: International Journal of Information Technology and Computer Science. 13:33-52
ISSN: 2074-9015
2074-9007
DOI: 10.5815/ijitcs.2021.03.03
Popis: Presidential campaign periods are a major trigger event for hate speech on social media in almost every country. A systematic review of previous studies indicates inadequate publicly available annotated datasets and hardly any evidence of theoretical underpinning for the annotation schemes used for hate speech identification. This situation stifles the development of empirically useful data for research, especially in supervised machine learning. This paper describes the methodology that was used to develop a multidimensional hate speech framework based on the duplex theory of hate [1] components that include distance, passion, commitment to hate, and hate as a story. Subsequently, an annotation scheme based on the framework was used to annotate a random sample of ~51k tweets from ~400k tweets that were collected during the August and October 2017 presidential campaign period in Kenya. This resulted in a goldstandard codeswitched dataset that could be used for comparative and empirical studies in supervised machine learning. The resulting classifiers trained on this dataset could be used to provide real-time monitoring of hate speech spikes on social media and inform data-driven decision-making by relevant security agencies in government.
Databáze: OpenAIRE