Pre-processing English-Hindi Corpus for Statistical Machine Translation

Autor: Karunesh Arora, Shyam Sunder Agrawal
Rok vydání: 2018
Předmět:
Zdroj: Computación y Sistemas. 21
ISSN: 2007-9737
1405-5546
DOI: 10.13053/cys-21-4-2697
Popis: Corpus may be considered as fuel for the data driven approaches of machine translation. Parallel corpus building is a labour intensive task, thus making it a costly and scarce resource. Full potential of available data needs to be exploited and this can be ensured by removing different types of inconsistencies as being faced throughout the NLP domain. The paper presented here describes the experiments carried out on corpus text pre-processing for building the baseline Statistical Machine Translation (SMT) system. Text pre-processing performed here is classified in two stages – i. the first one relates to handling of orthographic representation of content and ii. the second stage relates to handling of non-lexical words. The first stage covers punctuation symbols, casing, word spellings and their normalization while second stage covers handling of numbers and named entities (NEs) applied on the best settings observed in first stage. The motivation behind performing these experiments was to derive a relationship and gauge the extent of pre-processing the corpus, thereby building a considerably optimized baseline SMT system. This baseline system would provide platform for performing further experiments with different syntactic and semantic factors in future. The findings presented here is for English-Hindi language pair, however, the concept of pre-processing is language neutral and can be transcended to any other language pair. The best performance is reported with retaining the punctuation symbols, lower-cased English corpus and spell normalized Hindi corpus for English to Hindi translation. Further to these, in the second stage of experiments, handling numbers and Named Entities have been described wherein these are mapped to unique class labels. The impact of these experiments have been explained with their appropriateness for the concerned language pair.
Databáze: OpenAIRE