Malayalam Natural Language Processing: Challenges in Building a Phrase-Based Statistical Machine Translation System

Autor: Mary Priya Sebastian, Santhosh Kumar G.
Rok vydání: 2023
Předmět:
Zdroj: ACM Transactions on Asian and Low-Resource Language Information Processing. 22:1-51
ISSN: 2375-4702
2375-4699
Popis: Statistical Machine Translation (SMT) is a preferred Machine Translation approach to convert the text in a specific language into another by automatically learning translations using a parallel corpus. SMT has been successful in producing quality translations in many foreign languages, but there are only a few works attempted in South Indian languages. The article discusses on experiments conducted with SMT for Malayalam language and analyzes how the methods defined for SMT in foreign languages affect a Dravidian language, Malayalam. The baseline SMT model does not work for Malayalam due to its unique characteristics like agglutinative nature and morphological richness. Hence, the challenge is to identify where precisely the SMT model has to be modified such that it adapts the challenges of the language peculiarity into the baseline model and give better translations for English to Malayalam translation. The alignments between English and Malayalam sentence pairs, subjected to the training process in SMT, plays a crucial role in producing quality output translation. Therefore, this work focuses on improving the translation model of SMT by refining the alignments between English–Malayalam sentence pairs. The phrase alignment algorithms align the verb and noun phrases in the sentence pairs and develop a new set of alignments for the English–Malayalam sentence pairs. These alignment sets refine the alignments formed from Giza++ produced as a result of EM training algorithm. The improved Phrase-Based SMT model trained using these refined alignments resulted in better translation quality, as indicated by the AER and BLUE scores.
Databáze: OpenAIRE