Unsupervised dialectal neural machine translation

Autor: Ahmad Bisher Tarakji, Ruba Waleed Jaikat, Wael Farhan, Bashar Talafha, Mahmoud Al-Ayyoub, Anas Toma, Analle Abuammar
Rok vydání: 2020
Předmět:
Zdroj: Information Processing & Management. 57:102181
ISSN: 0306-4573
DOI: 10.1016/j.ipm.2019.102181
Popis: In this paper, we present the first work on unsupervised dialectal Neural Machine Translation (NMT), where the source dialect is not represented in the parallel training corpus. Two systems are proposed for this problem. The first one is the Dialectal to Standard Language Translation (D2SLT) system, which is based on the standard attentional sequence-to-sequence model while introducing two novel ideas leveraging similarities among dialects: using common words as anchor points when learning word embeddings and a decoder scoring mechanism that depends on cosine similarity and language models. The second system is based on the celebrated Google NMT (GNMT) system. We first evaluate these systems in a supervised setting (where the training and testing are done using our parallel corpus of Jordanian dialect and Modern Standard Arabic (MSA)) before going into the unsupervised setting (where we train each system once on a Saudi-MSA parallel corpus and once on an Egyptian-MSA parallel corpus and test them on the Jordanian-MSA parallel corpus). The highest BLEU score obtained in the unsupervised setting is 32.14 (by D2SLT trained on Saudi-MSA data), which is remarkably high compared with the highest BLEU score obtained in the supervised setting, which is 48.25.
Databáze: OpenAIRE