Word Embeddings as Statistical Estimators

Autor:	Dey, Neil, Singer, Matthew, Williams, Jonathan P., Sengupta, Srijan
Rok vydání:	2023
Předmět:	Statistics - Methodology
Druh dokumentu:	Working Paper
Popis:	Word embeddings are a fundamental tool in natural language processing. Currently, word embedding methods are evaluated on the basis of empirical performance on benchmark data sets, and there is a lack of rigorous understanding of their theoretical properties. This paper studies word embeddings from a statistical theoretical perspective, which is essential for formal inference and uncertainty quantification. We propose a copula-based statistical model for text data and show that under this model, the now-classical Word2Vec method can be interpreted as a statistical estimation method for estimating the theoretical pointwise mutual information (PMI). Next, by building on the work of Levy and Goldberg (2014), we develop a missing value-based estimator as a statistically tractable and interpretable alternative to the Word2Vec approach. The estimation error of this estimator is comparable to Word2Vec and improves upon the truncation-based method proposed by Levy and Goldberg (2014). The proposed estimator also performs comparably to Word2Vec in a benchmark sentiment analysis task on the IMDb Movie Reviews data set.
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2301.06710 Zobrazit plný text záznamu View this record from Arxiv