Effects of Positivization on the Paragraph Vector Model
Autor: | Aydin Gerek, Mehmet Can Yuney, Murat Can Ganiz, Erencan Erkaya |
---|---|
Rok vydání: | 2019 |
Předmět: |
Word embedding
Computer science business.industry Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing) Context (language use) 02 engineering and technology 010501 environmental sciences Semantics computer.software_genre 01 natural sciences Field (computer science) Semantic similarity 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Word2vec Artificial intelligence Paragraph business computer Word (computer architecture) Natural language processing 0105 earth and related environmental sciences |
Zdroj: | INISTA |
Popis: | Natural language processing (NLP) is an important field of Artificial Intelligence. One of the fundamental problems in NLP is to create vector (distributed) representations of words so that vectors of words that have similar meaning lie closer in space. One of the most popular algorithms for creating these representations are word embedding models such as word2vec and fastText. Similarly the paragraph vector model (doc2vec) is used to create distributed representations of documents while simultaneously creating distributed representations for the words in these documents. These models create a dense, and low dimensional (usually in the low hundreds) vector representations which may include negative values. In this study we focus on these negative values and introduce a family of regularization methods in which document, word and/or context vectors of the paragraph vector model are forced to have only positive components. We measure its effects on several tasks; text classification, semantic similarity, and analogy tasks. Although positivization greatly increases the sparsity of the word embeddings, and should be expected to result in a loss of information, our results show that there is almost no reduction in the performance of the regularized embeddings in these tasks. We also observe an increase in the classification accuracy in one case. We foresee that these approaches can be beneficial in machine learning systems which require non-negative vectors. |
Databáze: | OpenAIRE |
Externí odkaz: |