Balanced Word Clusters for Interpretable Document Representation
Autor: | Dirk Krechel, Marco Wrzalik |
---|---|
Rok vydání: | 2019 |
Předmět: |
Normalization (statistics)
Word embedding Computer science business.industry Cosine similarity Pattern recognition 02 engineering and technology 010501 environmental sciences 01 natural sciences Distance measures Weighting ComputingMethodologies_PATTERNRECOGNITION Similarity (network science) 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Word2vec Artificial intelligence Cluster analysis business tf–idf Word (computer architecture) 0105 earth and related environmental sciences |
Zdroj: | WML@ICDAR |
Popis: | We present Bag-of-Balanced-Concepts (BOBC), a document representation method for fuzzy and interpretable similarity estimation based on word clusters. For this purpose, a k-medoid variant is proposed, which iteratively resamples small clusters to introduce a tendency towards balanced cluster sizes. The necessary inter-word similarities for clustering are computed using GloVe or word2vec word embeddings. In this way, words that often share contexts tend to appear in the same clusters. Those clusters are used to represent documents as normalized probability distributions. Various distance measures acting as document dissimilarity estimators have been evaluated on five datasets. The impact of clustering parameters, input word vectors, and inverse document frequency weighting has been examined in our experiments. Furthermore, a comparison with document similarity estimation baselines has been performed. We demonstrate that, on average, our approach outperforms cosine similarity of both weighted Bag-of-Words vectors (TF-IDF and BM25) and word embedding centroids (Word Centroid Distance). |
Databáze: | OpenAIRE |
Externí odkaz: |