Using Confidential Data for Domain Adaptation of Neural Machine Translation

Autor:	Fatih Turkmen, Arianna Bisazza, Sohyung Kim
Přispěvatelé:	Computational Linguistics (CL), Information Systems
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	Domain adaptation Phrase Machine translation Computer science business.industry media_common.quotation_subject InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL computer.software_genre Fragment (logic) Simple (abstract algebra) ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Confidentiality Quality (business) Artificial intelligence business Adaptation (computer science) computer Natural language processing media_common
Zdroj:	Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 46-52 STARTPAGE=46;ENDPAGE=52;TITLE=Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Popis:	We propose a multilingual method for the extraction of biased sentences from Wikipedia, and use it to create corpora in Bulgarian, French and English. Sifting through the revision history of the articles that at some point had been considered biased and later corrected, we retrieve the last tagged and the first untagged revisions as the before/after snapshots of what was deemed a violation of Wikipedia's neutral point of view policy. We extract the sentences that were removed or rewritten in that edit. The approach yields sufficient data even in the case of relatively small Wikipedias, such as the Bulgarian one, where 62k articles produced 5k biased sentences. We evaluate our method by manually annotating 520 sentences for Bulgarian and French, and 744 for English. We assess the level of noise and analyze its sources. Finally, we exploit the data with well-known classification methods to detect biased sentences. Code and datasets are hosted at https://github.com/crim-ca/wiki-bias.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::ed6ca881b770d4b34d2d8cde754710fb https://research.rug.nl/en/publications/e7bd1a91-30a7-4129-b863-ce6138c6b72a Zobrazit plný text záznamu