Language localisation of Tamil using Statistical Machine Translation

Autor: Y. Achchuthan, K. Sarveswaran
Rok vydání: 2015
Předmět:
Zdroj: 2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer).
DOI: 10.1109/icter.2015.7377677
Popis: Language localisation, where the strings in interface and documentation are translated to a new language, is a rigorous and time consuming task. On the other hand machine translation systems, specifically Statistical Machine Translation (SMT) systems, are successfully used among many language pairs. A few SMT systems have been developed for generic domain; however, there are no systems available to aid localisation yet. This research proposes a new methodology in which language localisation can be done using SMT. This research also identifies suitable parameters on which a SMT aided localisation system could be built. A pilot system is developed and the system is also outlined in this paper. A RESTful API has also been developed to facilitate localisation in remote tools. Several open source software have been translated already to Tamil. Those translated English - Tamil pairs were collected from various language resource files and then cleaned, tokenised and were used to train the system. Another similar system is prepared with data from generic domain apart from the collected technical data. Systems were trained with 2-gram, 3-gram and 4-gram language models that are created using two different language modelling tools namely KenLM and IRSTLM. Then the results were evaluated using BLEU algorithm. Appropriate parameters for setting up SMT system for localisation were identified from the evaluation. The results show that it would be enough to train a system with 3-gram, and the modified BLEU algorithm will give better understanding of the results compare to the original implementation of it. Further KenLM was found to perform better than IRSTM in terms of accuracy of results and the speed of execution.
Databáze: OpenAIRE