Pre-processing English-Hindi Corpus for Statistical Machine Translation
Autor: | Karunesh Arora, Shyam Sunder Agrawal |
---|---|
Rok vydání: | 2018 |
Předmět: |
Normalization (statistics)
General Computer Science Machine translation Computer science media_common.quotation_subject 02 engineering and technology computer.software_genre Data-driven Task (project management) 030507 speech-language pathology & audiology 03 medical and health sciences 0202 electrical engineering electronic engineering information engineering media_common Hindi Class (computer programming) business.industry Punctuation language.human_language language 020201 artificial intelligence & image processing Artificial intelligence 0305 other medical science business computer Word (computer architecture) Natural language processing |
Zdroj: | Computación y Sistemas. 21 |
ISSN: | 2007-9737 1405-5546 |
DOI: | 10.13053/cys-21-4-2697 |
Popis: | Corpus may be considered as fuel for the data driven approaches of machine translation. Parallel corpus building is a labour intensive task, thus making it a costly and scarce resource. Full potential of available data needs to be exploited and this can be ensured by removing different types of inconsistencies as being faced throughout the NLP domain. The paper presented here describes the experiments carried out on corpus text pre-processing for building the baseline Statistical Machine Translation (SMT) system. Text pre-processing performed here is classified in two stages – i. the first one relates to handling of orthographic representation of content and ii. the second stage relates to handling of non-lexical words. The first stage covers punctuation symbols, casing, word spellings and their normalization while second stage covers handling of numbers and named entities (NEs) applied on the best settings observed in first stage. The motivation behind performing these experiments was to derive a relationship and gauge the extent of pre-processing the corpus, thereby building a considerably optimized baseline SMT system. This baseline system would provide platform for performing further experiments with different syntactic and semantic factors in future. The findings presented here is for English-Hindi language pair, however, the concept of pre-processing is language neutral and can be transcended to any other language pair. The best performance is reported with retaining the punctuation symbols, lower-cased English corpus and spell normalized Hindi corpus for English to Hindi translation. Further to these, in the second stage of experiments, handling numbers and Named Entities have been described wherein these are mapped to unique class labels. The impact of these experiments have been explained with their appropriateness for the concerned language pair. |
Databáze: | OpenAIRE |
Externí odkaz: |