Simulated Multiple Reference Training Improves Low-Resource Machine Translation
Autor: | Brian Thompson, Huda Khayrallah, Matt Post, Philipp Koehn |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2020 |
Předmět: |
FOS: Computer and information sciences
050101 languages & linguistics Computer Science - Computation and Language Machine translation Low resource Computer science business.industry 05 social sciences Training (meteorology) 02 engineering and technology Translation (geometry) computer.software_genre Paraphrase 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing 0501 psychology and cognitive sciences Artificial intelligence business Computation and Language (cs.CL) computer Sentence Natural language processing BLEU |
Zdroj: | EMNLP (1) |
Popis: | Many valid translations exist for a given sentence, yet machine translation (MT) is trained with a single reference translation, exacerbating data sparsity in low-resource settings. We introduce Simulated Multiple Reference Training (SMRT), a novel MT training method that approximates the full space of possible translations by sampling a paraphrase of the reference sentence from a paraphraser and training the MT model to predict the paraphraser's distribution over possible tokens. We demonstrate the effectiveness of SMRT in low-resource settings when translating to English, with improvements of 1.2 to 7.0 BLEU. We also find SMRT is complementary to back-translation. EMNLP 2020 camera ready |
Databáze: | OpenAIRE |
Externí odkaz: |