A self-training approach for short text clustering

Autor:	Lucas Sterckx, Amir Hadifar, Chris Develder, Thomas Demeester
Předmět:	business.industry Computer science 02 engineering and technology 010501 environmental sciences Document clustering Machine learning computer.software_genre 01 natural sciences Autoencoder ComputingMethodologies_PATTERNRECOGNITION Discriminative model 0202 electrical engineering electronic engineering information engineering Embedding 020201 artificial intelligence & image processing Artificial intelligence business Cluster analysis Encoder Self training computer Sentence 0105 earth and related environmental sciences
Zdroj:	Ghent University Academic Bibliography RepL4NLP@ACL
Popis:	Short text clustering is a challenging problem when adopting traditional bag-of-words or TF-IDF representations, since these lead to sparse vector representations for short texts. Low-dimensional continuous representations or embeddings can counter that sparseness problem: their high representational power is exploited in deep clustering algorithms. While deep clustering has been studied extensively in computer vision, relatively little work has focused on NLP. The method we propose, learns discriminative features from both an autoencoder and a sentence embedding, then uses assignments from a clustering algorithm as supervision to update weights of the encoder network. Experiments on three short text datasets empirically validate the effectiveness of our method.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::6c0b9a38606f3a24d20cab017232506d https://biblio.ugent.be/publication/8621468 Zobrazit plný text záznamu