BERT and fastText Embeddings for Automatic Detection of Toxic Speech

Autor:	Ashwin Geet D'Sa, Dominique Fohr, Irina Illina
Přispěvatelé:	Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Grid'5000, D'Sa, Ashwin Geet
Rok vydání:	2020
Předmět:	[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI] word embeddings 0209 industrial biotechnology Computer science business.industry Natural language processing Offensive 02 engineering and technology 16. Peace & justice computer.software_genre [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] 020901 industrial engineering & automation deep neural networks hate speech detection Classifier (linguistics) 0202 electrical engineering electronic engineering information engineering Deep neural networks 020201 artificial intelligence & image processing Social media The Internet Artificial intelligence business computer Word (computer architecture)
Zdroj:	OCTA SIIE 2020-Information Systems and Economic Intelligence; International Multi-Conference on:“Organization of Knowledge and Advanced Technologies”(OCTA) SIIE 2020-Information Systems and Economic Intelligence; International Multi-Conference on:“Organization of Knowledge and Advanced Technologies”(OCTA), Feb 2020, Tunis, Tunisia
DOI:	10.1109/octa49274.2020.9151853
Popis:	International audience; With the expansion of Internet usage, catering to the dissemination of thoughts and expressions of an individual, there has been an immense increase in the spread of online hate speech. Social media, community forums, discussion platforms are few examples of common playground of online discussions where people are freely allowed to communicate. However, the freedom of speech may be misused by some people by arguing aggressively, offending others and spreading verbal violence. As there is no clear distinction between the terms offensive, abusive, hate and toxic speech, in this paper we consider the above mentioned terms as toxic speech. In many countries, online toxic speech is punishable by the law. Thus, it is important to automatically detect and remove toxic speech from online medias. Through this work, we propose automatic classification of toxic speech using embedding representations of words and deep-learning techniques. We perform binary and multi-class classification using a Twitter corpus and study two approaches: (a) a method which consists in extracting of word embeddings and then using a DNN classifier; (b) fine-tuning the pre-trained BERT model. We observed that BERT fine-tuning performed much better. Proposed methodology can be used for any other type of social media comments.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::8a633b697189bff809cf2a3d44bc2f14 https://doi.org/10.1109/octa49274.2020.9151853 Zobrazit plný text záznamu