The application of source language information in Chinese-English statistical machine translation

Autor: Zhang, Yuqi
Přispěvatelé: Ney, Hermann
Jazyk: angličtina
Rok vydání: 2012
Předmět:
Zdroj: Aachen : Publikationsserver der RWTH Aachen University VIII, 120 S. : graph. Darst. (2012). = Aachen, Techn. Hochsch., Diss., 2012
Popis: The quality of machine translation (MT) has been significantly improved by using statistical approaches. The integration of syntactic knowledge into a statistical MT system is still an open problem. This talk investigates the application of syntactic knowledge of the source language to the phrase-based MT system for translating Chinese into English. In this thesis, particular issues have been addressed: the syntactic units (part-of-speech tags, chunks and trees) reordering of the source sentences; the treatment and analysis of unaligned words in the word alignment from the source side language and the consistent bilingual categorization in the pre-processing. In general the word order of a source language differs from that of the target language. The word reordering, especially the long-distance reordering, is a hard task in statistical machine translation. In order to tackle the reordering problem, this work investigates methods of reducing the number of units to be reordered by forming word groups. Syntactically relevant words are first clustered into syntactic phrases, which are then further reordered. In this work the reordering is modeled using different units such as part-of-speech (POS) tags, syntactic chunks, and trees. These labeled units are reordered using corresponding reordering rules, which are either learned automatically from training data (POS, chunks) or defined manually (trees). The experiments have been carried out on variant corpora sizes and shown that the chunk-based reordering works better than the POS-based method. The tree-based reordering works best on longer sentences. Although the experiments have been performed on Chinese-English translation, the chunk-based reordering is also suitable for other languages which have no good quality tree parser. In addition, our approaches have provided multiple reorderings for the translation system rather than only one reordering, in order to avoid translation errors from false reorderings. Another aspect of this thesis is the analysis of unaligned words. Sometimes a word in the source language has no corresponding translation in the target language, which brings about unaligned words in the word alignment. This work argues that these unaligned words cause translation errors such as word deletions and word insertions. To test this hypothesis, the most frequently unaligned words in the source language are completely deleted (hard deletion) or conditionally deleted (soft deletion). Both approaches result in an improvement in the translation quality. In the pre-processing step of the phrase-based statistical translation system, some words such as dates and numbers are categorized in order to reduce the translation vocabulary. The category rules have been built manually for the source and target languages, respectively. In this way, the modification of the category rules can be very time-consuming and translation output is hard to predict. We have developed a semi-automatic approach to derive the Chinese category rules from the English categories via word alignment. With this approach, a change of the rules only needs to be manually introduced on the English side, and the Chinese rules can be learned automatically. Moreover, this approach makes it easier to adapt the category rules to new domain and new data. The experiments have been carried out on variant sizes of the Chinese-English translation tasks. The results have been compared to the strong baseline of a state-of-the-art phrase-based translation system. The systems with the reordering methods have been successfully applied to the GALE, NIST and IWSLT evaluations.
Databáze: OpenAIRE