Data Selection for Unsupervised Translation of German--Upper Sorbian

Autor: Lukas Edman, Antonio Toral, Gertjan van Noord
Jazyk: angličtina
Zdroj: Proceedings of the Fifth Conference on Machine Translation (WMT), 1099-1103
STARTPAGE=1099;ENDPAGE=1103;TITLE=Proceedings of the Fifth Conference on Machine Translation (WMT)
University of Groningen
Popis: This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2020 Unsupervised Machine Translation task for German--Upper Sorbian. We investigate the usefulness of data selection in the unsupervised setting. We find that we can perform data selection using a pretrained model and show that the quality of a set of sentences or documents can have a great impact on the performance of the UNMT system trained on it. Furthermore, we show that document-level data selection should be preferred for training the XLM model when possible. Finally, we show that there is a trade-off between quality and quantity of the data used to train UNMT systems.
Databáze: OpenAIRE