Data Selection as an Alternative to Quality Estimation in Self-Learning for Low Resource Neural Machine Translation
Autor: | Ibrahim Said Ahmad, Rabiu Ibrahim Abdullahi, Idris Abdulmumin, Bashir Shehu Galadanci |
---|---|
Rok vydání: | 2021 |
Předmět: |
Estimation
Machine translation Computer science business.industry media_common.quotation_subject computer.software_genre Translation (geometry) Machine learning Domain (software engineering) Test set Quality (business) Artificial intelligence business Implementation computer Data selection media_common |
Zdroj: | Computational Science and Its Applications – ICCSA 2021 ISBN: 9783030870126 ICCSA (9) |
DOI: | 10.1007/978-3-030-87013-3_24 |
Popis: | For many languages, the lack of sufficient parallel data to train translation models have resulted in using the monolingual data, source and target, through self-learning and back-translation respectively. Most works that implemented the self-learning approach utilized a quality estimation system to ensure that the resulting additional training data is of sufficient quality to improve the model. However, the quality estimation system may not be available for many low resource languages, restricting the implementation of such approach to a very few. This work proposes the utilization of the data selection technique as an alternative to quality estimation. The approach will ensure that the models will learn only from the data that is closer to the domain of the test set, improving the performance of the translation models. While this approach is applicable to many, if not all, languages, we obtained similar and, in some implementations, even better results (\(+\)0.53 BLEU) than the self-training approach that was implemented using the quality estimation system on low resource IWSLT’14 English-German dataset. We also showed that the proposed approach can be used to improve the performance of the back-translation approach, gaining \(+\)1.79 and \(+\)0.23 over standard back-translation and self-learning with quality estimation enhanced back-translation respectively. |
Databáze: | OpenAIRE |
Externí odkaz: |