Using Confidential Data for Domain Adaptation of Neural Machine Translation

Autor: Fatih Turkmen, Arianna Bisazza, Sohyung Kim
Přispěvatelé: Computational Linguistics (CL), Information Systems
Jazyk: angličtina
Rok vydání: 2021
Předmět:
Zdroj: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 46-52
STARTPAGE=46;ENDPAGE=52;TITLE=Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Popis: We propose a multilingual method for the extraction of biased sentences from Wikipedia, and use it to create corpora in Bulgarian, French and English. Sifting through the revision history of the articles that at some point had been considered biased and later corrected, we retrieve the last tagged and the first untagged revisions as the before/after snapshots of what was deemed a violation of Wikipedia's neutral point of view policy. We extract the sentences that were removed or rewritten in that edit. The approach yields sufficient data even in the case of relatively small Wikipedias, such as the Bulgarian one, where 62k articles produced 5k biased sentences. We evaluate our method by manually annotating 520 sentences for Bulgarian and French, and 744 for English. We assess the level of noise and analyze its sources. Finally, we exploit the data with well-known classification methods to detect biased sentences. Code and datasets are hosted at https://github.com/crim-ca/wiki-bias.
Databáze: OpenAIRE