Autor: |
Vulic, Ivan, Moens, Marie-Francine |
Přispěvatelé: |
Kay, Martin, Boitet, Christian |
Jazyk: |
angličtina |
Rok vydání: |
2012 |
Předmět: |
|
Popis: |
We propose a novel associative approach for bilingual word lexicon extraction (BLE) from parallel corpora that relies on the paradigm of data reduction instead of data augmentation. The key insight of the approach is the effective usage of sub-corpora sampling and properties of low-frequency words in the task of lexicon induction, particularly in a setting where only limited parallel data are available. Word translation pairs are extracted from many smaller sub-corpora (sampled from the original corpus) according to several frequency-based criteria of similarity. We prove the validity of our data sampling approach, and show that this method outperforms IBM Model 1 and associative methods based on similarity scores and hypothesis testing in terms of precision and F-measure in the task of lexicon extraction. Additionally, we show that our sampling-based method can learn correct word translations from fewer data. ispartof: pages:2721-2738 ispartof: Proceedings of the the 24th International Conference on Computational Linguistics (COLING 2012) pages:2721-2738 ispartof: International Conference on Computational Linguistics (COLING) location:Mumbai, India date:8 Dec - 15 Dec 2012 status: published |
Databáze: |
OpenAIRE |
Externí odkaz: |
|