Improving Phrase-Based Statistical Machine Translation with Preprocessing Techniques

Autor:	Sanath Jayasena, S. Yashothara, R. T. Uthayasanker
Rok vydání:	2018
Předmět:	060201 languages & linguistics Phrase Machine translation Computer science business.industry 06 humanities and the arts 02 engineering and technology Semantics Part of speech computer.software_genre language.human_language Tokenization (data security) Tamil 0602 languages and literature Chunking (psychology) 0202 electrical engineering electronic engineering information engineering language Preprocessor 020201 artificial intelligence & image processing Artificial intelligence business computer Natural language processing
Zdroj:	IALP
DOI:	10.1109/ialp.2018.8629203
Popis:	This work presents an improvement to phrase-based statistical machine translation models which incorporates linguistic knowledge, namely parts-of-speech information and preprocessing techniques. Any Statistical Machine Translation (SMT) System needs large parallel corpora for exact performance. So, non-availability of corpora limits the success achievable in machine translation to and from those languages. In this study, we choose Sinhala to Tamil translation which gains importance since both of them are acknowledged as official languages of Sri Lanka and also resource-poor languages. Even though findings presented here is for Sinhala to Tamil translation, the concept of pre-processing is language neutral and can be transcended to any other language pair with different parameters. To overcome the translation challenges in traditional SMT, preprocessing techniques are used. Preprocessing described in the research is related to generating phrasal units, Parts of Speech (POS) integration and segmentation. At the end, automatic evaluation of the system is performed by using BLEU as evaluation metrics. We observed all preprocessing techniques outperform the baseline system. The best performance is reported with PMI based chunking for Sinh ala to Tamil translation. We could improve performance by 12% BLEU (3.56) using a small Sinhala to Tamil corpus with the help of proposed PMI based preprocessing. Notably, this increase is significantly higher compared to the increase shown by prior approaches for the same language pair.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::287e81fac811ddcdd94f70a9353fd56b https://doi.org/10.1109/ialp.2018.8629203 Zobrazit plný text záznamu