Fine-Tuning Language Models For Semi-Supervised Text Mining
Autor: | Chen Xinyu, Cynthia Freeman, Ian Beaver |
---|---|
Rok vydání: | 2020 |
Předmět: |
Context model
business.industry Computer science 02 engineering and technology Document clustering computer.software_genre Data modeling Task (project management) Text mining 020204 information systems 0202 electrical engineering electronic engineering information engineering Task analysis 020201 artificial intelligence & image processing Language model Artificial intelligence business Cluster analysis computer Natural language processing |
Zdroj: | IEEE BigData |
DOI: | 10.1109/bigdata50022.2020.9377810 |
Popis: | The dimensionality of traditional text representation is large, but the underlying text data is sparse. This makes text clustering a very challenging task. Using language models and deep contextualized representations is promising in many Natural Language Processing (NLP) tasks. However, some task-specific guidance is necessary to adapt language models to a novel domain or to particular downstream tasks. We present an empirical study of a pipeline for semi-supervised text clustering tasks. Our proposed method utilizes a small number of labeled samples to fine-tune pre-trained language models. This fine-tuning step adapts the language models to produce task-specific contextualized representations, improving the performance of downstream text clustering tasks. We evaluate two clustering algorithms using the output of three different language models on six real-world text mining tasks to demonstrate to what extent this pipeline can improve text clustering accuracy and the amount of labeled samples needed for improvement. Our experiments show that for topic mining in novel domains or surfacing the intentions of abstracts, language models begin to produce better task-specific representations using a labeled subset as small as 0.5% of the task data. On the other hand, to find topics in domains that are overlapping with pre-training corpora, language models need labeled subsets closer to 1.0% of the task data to overcome the catastrophic forgetting problem. Further experiments show the downstream clustering accuracy gain begins to slow down or plateau if language models are fine-tuned with more than 5% of the task data. There is a trade-off between the desired downstream clustering quality and the cost for labeling and fine-tuning language models. Fine-tuned with 2.5% of the task data, our approach matches or exceeds the current state-of-the-art for several clustering tasks and provides baseline results for two novel clustering tasks. These results provide solid guidance to utilize powerful language models for text clustering and information retrieval practitioners. |
Databáze: | OpenAIRE |
Externí odkaz: |