Multilingual Multi-Domain NMT for Indian Languages

Autor: Salil Aggarwal, Dipti Misra Sharma, Sourav Kumar
Rok vydání: 2021
Předmět:
Zdroj: RANLP
Popis: India is known as the land of many tongues and dialects. Neural machine translation (NMT) is the current state-of-the-art approach for machine translation (MT) but performs better only with large datasets which Indian languages usually lack, making this approach infeasible. So, in this paper, we address the problem of data scarcity by efficiently training multilingual and multilingual multi domain NMT systems involving languages of the 𝐈𝐧𝐝𝐢𝐚𝐧 𝐬𝐮𝐛𝐜𝐨𝐧𝐭𝐢𝐧𝐞𝐧𝐭. We are proposing the technique for using the joint domain and language tags in a multilingual setup. We draw three major conclusions from our experiments: (i) Training a multilingual system via exploiting lexical similarity based on language family helps in achieving an overall average improvement of 𝟑.𝟐𝟓 𝐁𝐋𝐄𝐔 𝐩𝐨𝐢𝐧𝐭𝐬 over bilingual baselines, (ii) Technique of incorporating domain information into the language tokens helps multilingual multi-domain system in getting a significant average improvement of 𝟔 𝐁𝐋𝐄𝐔 𝐩𝐨𝐢𝐧𝐭𝐬 over the baselines, (iii) Multistage fine-tuning further helps in getting an improvement of 𝟏-𝟏.𝟓 𝐁𝐋𝐄𝐔 𝐩𝐨𝐢𝐧𝐭𝐬 for the language pair of interest.
Databáze: OpenAIRE