Studying the Role of Data Quality on Statistical and Neural Machine Translation

Autor: Geetam Singh Tomar, Shyam Sunder Agrawal, Karunesh Arora
Rok vydání: 2021
Předmět:
Zdroj: 2021 10th IEEE International Conference on Communication Systems and Network Technologies (CSNT).
Popis: Statistical and Neural Machine translation techniques are based on the parallel data used for training models. The general belief is that more training data would result in better models. We studied the available corpora and ambient noises present in them. It revealed that the available data is highly noisy. The paper describes various types of noises present in there and how these are identified. Different types of noise filters are developed and normalization processes have been applied on the corpora. Statistical and neural machine translation models are trained to study the impact of cleaning of noisy data. We performed experiments with noisy data and with cleaned data after discarding noisy data from the training corpus. Standard test set WMT-14 has been used for performing evaluation. The quality of machine translation has been measured through BLEU scores. It was observed that even after discarding a significant volume of noisy data, the models without noisy data performed better than the corpus containing noises. It proves that quality of data has significant impact and mere having huge piles of uncleaned data in not a good choice. The test case presented here is for English-Hindi language pair. It also shows a path that for low resource language pairs, paying attention to the quality of data would bring returns in form of better translation performance. As the noises discussed in paper are general in nature, the findings should be true for any other Indian language pair also, due to inherent similarity among Indian languages.
Databáze: OpenAIRE