Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages
Autor: | Veysel Yucesoy, Aykut Koc |
---|---|
Rok vydání: | 2019 |
Předmět: |
Scheme (programming language)
General Computer Science Computer science Turkish business.industry 05 social sciences Sentiment analysis A-weighting 010501 environmental sciences computer.software_genre 01 natural sciences language.human_language Weighting 0502 economics and business Selection (linguistics) language Artificial intelligence Computational linguistics business computer 050203 business & management Natural language processing Word (computer architecture) 0105 earth and related environmental sciences computer.programming_language |
Zdroj: | ACM Transactions on Asian and Low-Resource Language Information Processing. 18:1-18 |
ISSN: | 2375-4702 2375-4699 |
DOI: | 10.1145/3282443 |
Popis: | This study aims to increase the performance of word embeddings by proposing a new weighting scheme for co-occurrence counting. The idea behind this new family of weights is to overcome the disadvantage of distant appearing word pairs, which are indeed semantically close, while representing them in the co-occurrence counting. For high-resource languages, this disadvantage might not be effective due to the high frequency of co-occurrence. However, when there are not enough available resources, such pairs suffer from being distant. To favour such pairs, a weighting scheme based on a polynomial fitting procedure is proposed to shift the weights up for distant words while the weights of nearby words are left almost unchanged. The parameter optimization for new weights and the effects of the weighting scheme are analysed for the English, Italian, and Turkish languages. A small portion of English resources and a quarter of Italian resources are utilized for demonstration purposes, as if these languages are low-resource languages. Performance increase is observed in analogy tests when the proposed weighting scheme is applied to relatively small corpora (i.e., mimicking low-resource languages) of both English and Italian. To show the effectiveness of the proposed scheme in small corpora, it is also shown for a large English corpus that the performance of the proposed weighting scheme cannot outperform the original weights. Since Turkish is relatively a low-resource language, it is demonstrated that the proposed weighting scheme can increase the performance of both analogy and similarity tests when all Turkish Wikipedia pages are utilized as a corpus. The positive effect of the proposed scheme has also been demonstrated in a standard sentiment analysis task for the Turkish language. |
Databáze: | OpenAIRE |
Externí odkaz: |