Improving Phrase-Based Statistical Machine Translation with Preprocessing Techniques

Autor: Sanath Jayasena, S. Yashothara, R. T. Uthayasanker
Rok vydání: 2018
Předmět:
Zdroj: IALP
DOI: 10.1109/ialp.2018.8629203
Popis: This work presents an improvement to phrase-based statistical machine translation models which incorporates linguistic knowledge, namely parts-of-speech information and preprocessing techniques. Any Statistical Machine Translation (SMT) System needs large parallel corpora for exact performance. So, non-availability of corpora limits the success achievable in machine translation to and from those languages. In this study, we choose Sinhala to Tamil translation which gains importance since both of them are acknowledged as official languages of Sri Lanka and also resource-poor languages. Even though findings presented here is for Sinhala to Tamil translation, the concept of pre-processing is language neutral and can be transcended to any other language pair with different parameters. To overcome the translation challenges in traditional SMT, preprocessing techniques are used. Preprocessing described in the research is related to generating phrasal units, Parts of Speech (POS) integration and segmentation. At the end, automatic evaluation of the system is performed by using BLEU as evaluation metrics. We observed all preprocessing techniques outperform the baseline system. The best performance is reported with PMI based chunking for Sinh ala to Tamil translation. We could improve performance by 12% BLEU (3.56) using a small Sinhala to Tamil corpus with the help of proposed PMI based preprocessing. Notably, this increase is significantly higher compared to the increase shown by prior approaches for the same language pair.
Databáze: OpenAIRE