Improving Phrase-Based Statistical Machine Translation with Preprocessing Techniques
Autor: | Sanath Jayasena, S. Yashothara, R. T. Uthayasanker |
---|---|
Rok vydání: | 2018 |
Předmět: |
060201 languages & linguistics
Phrase Machine translation Computer science business.industry 06 humanities and the arts 02 engineering and technology Semantics Part of speech computer.software_genre language.human_language Tokenization (data security) Tamil 0602 languages and literature Chunking (psychology) 0202 electrical engineering electronic engineering information engineering language Preprocessor 020201 artificial intelligence & image processing Artificial intelligence business computer Natural language processing |
Zdroj: | IALP |
DOI: | 10.1109/ialp.2018.8629203 |
Popis: | This work presents an improvement to phrase-based statistical machine translation models which incorporates linguistic knowledge, namely parts-of-speech information and preprocessing techniques. Any Statistical Machine Translation (SMT) System needs large parallel corpora for exact performance. So, non-availability of corpora limits the success achievable in machine translation to and from those languages. In this study, we choose Sinhala to Tamil translation which gains importance since both of them are acknowledged as official languages of Sri Lanka and also resource-poor languages. Even though findings presented here is for Sinhala to Tamil translation, the concept of pre-processing is language neutral and can be transcended to any other language pair with different parameters. To overcome the translation challenges in traditional SMT, preprocessing techniques are used. Preprocessing described in the research is related to generating phrasal units, Parts of Speech (POS) integration and segmentation. At the end, automatic evaluation of the system is performed by using BLEU as evaluation metrics. We observed all preprocessing techniques outperform the baseline system. The best performance is reported with PMI based chunking for Sinh ala to Tamil translation. We could improve performance by 12% BLEU (3.56) using a small Sinhala to Tamil corpus with the help of proposed PMI based preprocessing. Notably, this increase is significantly higher compared to the increase shown by prior approaches for the same language pair. |
Databáze: | OpenAIRE |
Externí odkaz: |