Fine-Tuning Language Models For Semi-Supervised Text Mining

Autor: Chen Xinyu, Cynthia Freeman, Ian Beaver
Rok vydání: 2020
Předmět:
Zdroj: IEEE BigData
DOI: 10.1109/bigdata50022.2020.9377810
Popis: The dimensionality of traditional text representation is large, but the underlying text data is sparse. This makes text clustering a very challenging task. Using language models and deep contextualized representations is promising in many Natural Language Processing (NLP) tasks. However, some task-specific guidance is necessary to adapt language models to a novel domain or to particular downstream tasks. We present an empirical study of a pipeline for semi-supervised text clustering tasks. Our proposed method utilizes a small number of labeled samples to fine-tune pre-trained language models. This fine-tuning step adapts the language models to produce task-specific contextualized representations, improving the performance of downstream text clustering tasks. We evaluate two clustering algorithms using the output of three different language models on six real-world text mining tasks to demonstrate to what extent this pipeline can improve text clustering accuracy and the amount of labeled samples needed for improvement. Our experiments show that for topic mining in novel domains or surfacing the intentions of abstracts, language models begin to produce better task-specific representations using a labeled subset as small as 0.5% of the task data. On the other hand, to find topics in domains that are overlapping with pre-training corpora, language models need labeled subsets closer to 1.0% of the task data to overcome the catastrophic forgetting problem. Further experiments show the downstream clustering accuracy gain begins to slow down or plateau if language models are fine-tuned with more than 5% of the task data. There is a trade-off between the desired downstream clustering quality and the cost for labeling and fine-tuning language models. Fine-tuned with 2.5% of the task data, our approach matches or exceeds the current state-of-the-art for several clustering tasks and provides baseline results for two novel clustering tasks. These results provide solid guidance to utilize powerful language models for text clustering and information retrieval practitioners.
Databáze: OpenAIRE