Data Selection as an Alternative to Quality Estimation in Self-Learning for Low Resource Neural Machine Translation

Autor: Ibrahim Said Ahmad, Rabiu Ibrahim Abdullahi, Idris Abdulmumin, Bashir Shehu Galadanci
Rok vydání: 2021
Předmět:
Zdroj: Computational Science and Its Applications – ICCSA 2021 ISBN: 9783030870126
ICCSA (9)
DOI: 10.1007/978-3-030-87013-3_24
Popis: For many languages, the lack of sufficient parallel data to train translation models have resulted in using the monolingual data, source and target, through self-learning and back-translation respectively. Most works that implemented the self-learning approach utilized a quality estimation system to ensure that the resulting additional training data is of sufficient quality to improve the model. However, the quality estimation system may not be available for many low resource languages, restricting the implementation of such approach to a very few. This work proposes the utilization of the data selection technique as an alternative to quality estimation. The approach will ensure that the models will learn only from the data that is closer to the domain of the test set, improving the performance of the translation models. While this approach is applicable to many, if not all, languages, we obtained similar and, in some implementations, even better results (\(+\)0.53 BLEU) than the self-training approach that was implemented using the quality estimation system on low resource IWSLT’14 English-German dataset. We also showed that the proposed approach can be used to improve the performance of the back-translation approach, gaining \(+\)1.79 and \(+\)0.23 over standard back-translation and self-learning with quality estimation enhanced back-translation respectively.
Databáze: OpenAIRE