Machine Translation Advancements of Low-Resource Indian Languages by Transfer Learning

Autor: Wei, Bin, Zhen, Jiawei, Li, Zongyao, Wu, Zhanglin, Wei, Daimeng, Guo, Jiaxin, Rao, Zhiqiang, Li, Shaojun, Luo, Yuanchang, Shang, Hengchao, Yang, Jinlong, Xie, Yuhao, Yang, Hao
Rok vydání: 2024
Předmět:
Druh dokumentu: Working Paper
Popis: This paper introduces the submission by Huawei Translation Center (HW-TSC) to the WMT24 Indian Languages Machine Translation (MT) Shared Task. To develop a reliable machine translation system for low-resource Indian languages, we employed two distinct knowledge transfer strategies, taking into account the characteristics of the language scripts and the support available from existing open-source models for Indian languages. For Assamese(as) and Manipuri(mn), we fine-tuned the existing IndicTrans2 open-source model to enable bidirectional translation between English and these languages. For Khasi (kh) and Mizo (mz), We trained a multilingual model as a baseline using bilingual data from these four language pairs, along with an additional about 8kw English-Bengali bilingual data, all of which share certain linguistic features. This was followed by fine-tuning to achieve bidirectional translation between English and Khasi, as well as English and Mizo. Our transfer learning experiments produced impressive results: 23.5 BLEU for en-as, 31.8 BLEU for en-mn, 36.2 BLEU for as-en, and 47.9 BLEU for mn-en on their respective test sets. Similarly, the multilingual model transfer learning experiments yielded impressive outcomes, achieving 19.7 BLEU for en-kh, 32.8 BLEU for en-mz, 16.1 BLEU for kh-en, and 33.9 BLEU for mz-en on their respective test sets. These results not only highlight the effectiveness of transfer learning techniques for low-resource languages but also contribute to advancing machine translation capabilities for low-resource Indian languages.
Comment: 6 pages, wmt24. arXiv admin note: substantial text overlap with arXiv:2409.14800
Databáze: arXiv