Unsupervised Domain Adaptation in Cross-corpora Abusive Language Detection

Autor:	Irina Illina, Dominique Fohr, Tulika Bose
Přispěvatelé:	Bose, Tulika, ISITE - Isite LUE - - LUE2015 - ANR-15-IDEX-0004 - IDEX - VALID, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), This work was supported partly by the french PIA project 'Lorraine Université d’Excellence', reference ANR-15-IDEX-04-LUE., IMPACT-OLKi, GRID5000, ANR-15-IDEX-0004,LUE,Isite LUE(2015), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI] Domain adaptation Language identification Computer science business.industry 05 social sciences 02 engineering and technology [INFO] Computer Science [cs] computer.software_genre [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] Task (project management) Annotation 0202 electrical engineering electronic engineering information engineering [INFO]Computer Science [cs] 020201 artificial intelligence & image processing Artificial intelligence Language model 0509 other social sciences 050904 information & library sciences business Adaptation (computer science) computer Natural language processing
Zdroj:	SocialNLP@NAACL SocialNLP 2021-The 9th International Workshop on Natural Language Processing for Social Media SocialNLP 2021-The 9th International Workshop on Natural Language Processing for Social Media, Jun 2021, Virtual, France
Popis:	International audience; The state-of-the-art abusive language detection models report great in-corpus performance, but underperform when evaluated on abusive comments that differ from the training scenario. As human annotation involves substantial time and effort, models that can adapt to newly collected comments can prove to be useful. In this paper, we investigate the effectiveness of several Unsupervised Domain Adaptation (UDA) approaches for the task of cross-corpora abusive language detection. In comparison, we adapt a variant of the BERT model, trained on large-scale abusive comments, using Masked Language Model (MLM) fine-tuning. Our evaluation shows that the UDA approaches result in sub-optimal performance, while the MLM fine-tuning does better in the cross-corpora setting. Detailed analysis reveals the limitations of the UDA approaches and emphasizes the need to build efficient adaptation methods for this task.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::1b41d5198902051d7781f7713447ce8a https://hal.inria.fr/hal-03204605 Zobrazit plný text záznamu