Balanced Word Clusters for Interpretable Document Representation

Autor: Dirk Krechel, Marco Wrzalik
Rok vydání: 2019
Předmět:
Zdroj: WML@ICDAR
Popis: We present Bag-of-Balanced-Concepts (BOBC), a document representation method for fuzzy and interpretable similarity estimation based on word clusters. For this purpose, a k-medoid variant is proposed, which iteratively resamples small clusters to introduce a tendency towards balanced cluster sizes. The necessary inter-word similarities for clustering are computed using GloVe or word2vec word embeddings. In this way, words that often share contexts tend to appear in the same clusters. Those clusters are used to represent documents as normalized probability distributions. Various distance measures acting as document dissimilarity estimators have been evaluated on five datasets. The impact of clustering parameters, input word vectors, and inverse document frequency weighting has been examined in our experiments. Furthermore, a comparison with document similarity estimation baselines has been performed. We demonstrate that, on average, our approach outperforms cosine similarity of both weighted Bag-of-Words vectors (TF-IDF and BM25) and word embedding centroids (Word Centroid Distance).
Databáze: OpenAIRE