Malay Manuscripts Transliteration Using Statistical Machine Translation (SMT)
Autor: | Muhamad Sadry Abu Seman, Mohammad Noor, Wan Ahmad Syahril Rozli Wan Ali, Noor Hasrul Nizan, Wan Yusoff Wan, Sitti Munirah Abdul Razak |
---|---|
Rok vydání: | 2019 |
Předmět: |
0209 industrial biotechnology
Phrase Machine translation business.industry Computer science Word error rate 02 engineering and technology computer.software_genre language.human_language 020901 industrial engineering & automation 0202 electrical engineering electronic engineering information engineering language Transliteration 020201 artificial intelligence & image processing Artificial intelligence business computer Natural language processing Malay |
Zdroj: | 2019 1st International Conference on Artificial Intelligence and Data Sciences (AiDAS). |
DOI: | 10.1109/aidas47888.2019.8970867 |
Popis: | Natural Language Processing (NLP) is a vital field of artificial intelligence that automates the study of human language. However for Malay manuscripts (MM) written in old jawi, its exposure on such field is limited. Besides, most of the studies related to MM studies and NLP were focused on rule based or rule based machine transliteration (RBMT). Hence the objective of this study is to propose a statistical approach for old jawi to modern jawi transliteration of Malay manuscript contents using Phrase Based Statistical Machine Translation (PBSMT) as its model. In order to achieve such purpose, quality score of Word Error Rate (WER) was computed on the transliteration output. Besides, the issues formerly encountered by rule based approach such as vocals limitation and homograph, reduplication, letters error and combination of multiple words were observed in the implementation. Moreover, this paper utilized exploratory approach as its research strategy and mixed method as its research method. The data for the analysis were extracted from a MM titled Bidāyat al-Mubtadī bi-Fālillah al-Muhdī. Quality score of WER was computed for the evaluation of SMT output. Afterwards, related issues were identified and assessed. The research found that quality score of PBSMT for old jawi to modern jawi transliteration was high in terms of WER, however the issues of rule based were generally addressed by PBSMT except homograph. The research is however limited to the approach of SMT that solely focused on PBSMT as its model. Moreover, the corpus size was limited to one manuscript while SMT relies on corpus size. Nevertheless the research contributes to the wider coverage on Malay language as one of the under resource languages in NLP, in form of old and modern jawi. Besides, to the best of the researcher’s knowledge, it is also the first to apply SMT (PBSMT) approach on old jawi transliteration. Most importantly, the study is to contribute on MM’s. |
Databáze: | OpenAIRE |
Externí odkaz: |