Using Confidential Data for Domain Adaptation of Neural Machine Translation
Autor: | Fatih Turkmen, Arianna Bisazza, Sohyung Kim |
---|---|
Přispěvatelé: | Computational Linguistics (CL), Information Systems |
Jazyk: | angličtina |
Rok vydání: | 2021 |
Předmět: |
Domain adaptation
Phrase Machine translation Computer science business.industry media_common.quotation_subject InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL computer.software_genre Fragment (logic) Simple (abstract algebra) ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Confidentiality Quality (business) Artificial intelligence business Adaptation (computer science) computer Natural language processing media_common |
Zdroj: | Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 46-52 STARTPAGE=46;ENDPAGE=52;TITLE=Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics |
Popis: | We propose a multilingual method for the extraction of biased sentences from Wikipedia, and use it to create corpora in Bulgarian, French and English. Sifting through the revision history of the articles that at some point had been considered biased and later corrected, we retrieve the last tagged and the first untagged revisions as the before/after snapshots of what was deemed a violation of Wikipedia's neutral point of view policy. We extract the sentences that were removed or rewritten in that edit. The approach yields sufficient data even in the case of relatively small Wikipedias, such as the Bulgarian one, where 62k articles produced 5k biased sentences. We evaluate our method by manually annotating 520 sentences for Bulgarian and French, and 744 for English. We assess the level of noise and analyze its sources. Finally, we exploit the data with well-known classification methods to detect biased sentences. Code and datasets are hosted at https://github.com/crim-ca/wiki-bias. |
Databáze: | OpenAIRE |
Externí odkaz: |