An End-to-End Method for Data Filtering on Tibetan-Chinese Parallel Corpus via Negative Sampling

Autor: Cizhen Jiacuo, Sangjie Duanzhu, Sanzhi Jia, Cairang Jia, Rou Te
Rok vydání: 2019
Předmět:
Zdroj: Lecture Notes in Computer Science ISBN: 9783030323806
CCL
DOI: 10.1007/978-3-030-32381-3_34
Popis: In the field of machine translation, parallel corpus serves as the most important prerequisite for learning complex mappings between targeted language pairs. However, in practice, the scale of parallel corpus is not necessarily the only factor to be taken into consideration for improving performance of translation models due to the quality of parallel data itself also has tremendous impact on model capacity. In recent years, neural machine translation systems have become the de facto choice of implementation in MT research, but they are more vulnerable to noisy disturbance presented in training data compared with traditional statistical machine translation models. Therefore, data filtering is an indispensable procedure in NMT pre-processing pipeline. Instead of utilizing discrete feature representations of basic language units to build a ranking function of given sentence pairs, in this work, we proposed a fully end-to-end parallel sentence classifier to estimate the probability of given sentence pairs being equivalent translation for each other. Our model was tested in three scenarios, namely, classification, sentence extraction and NMT data filtering tasks. All testing experiments showed promising results, and especially in Tibetan-Chinese NMT experiments, 3.7 BLEU boost was observed after applying our data filtering method, indicating the effectiveness of our model.
Databáze: OpenAIRE