Applying N-gram Alignment Entropy to Improve Feature Decay Algorithms
Autor: | Alberto Poncelas, Andy Way, Gideon Maillette de Buy Wenniger |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2017 |
Předmět: |
Machine translation
Computer science media_common.quotation_subject 02 engineering and technology computer.software_genre 0202 electrical engineering electronic engineering information engineering Entropy (information theory) media_common 060201 languages & linguistics business.industry Decay factor Pattern recognition 06 humanities and the arts Ambiguity Feature Decay Algorithms (FDA) Exponential function n-gram Test set 0602 languages and literature Computational linguistics. Natural language processing 020201 artificial intelligence & image processing Artificial intelligence P98-98.5 business computer Algorithm Machine translating Data selection Algorithms |
Zdroj: | Prague Bulletin of Mathematical Linguistics, Vol 108, Iss 1, Pp 245-256 (2017) The Prague Bulletin of Mathematical Linguistics Poncelas, Alberto ORCID: 0000-0002-5089-1687 |
ISSN: | 1804-0462 |
Popis: | Data Selection is a popular step in Machine Translation pipelines. Feature Decay Algorithms (FDA) is a technique for data selection that has shown a good performance in several tasks. FDA aims to maximize the coverage of n-grams in the test set. However, intuitively, more ambiguous n-grams require more training examples in order to adequately estimate their translation probabilities. This ambiguity can be measured by alignment entropy. In this paper we propose two methods for calculating the alignment entropies for n-grams of any size, which can be used for improving the performance of FDA. We evaluate the substitution of the n-gram-specific entropy values computed by these methods to the parameters of both the exponential and linear decay factor of FDA. The experiments conducted on German-to-English and Czech-to-English translation demonstrate that the use of alignment entropies can lead to an increase in the quality of the results of FDA. |
Databáze: | OpenAIRE |
Externí odkaz: |