Improving Intent Classification Using Unlabeled Data from Large Corpora

Autor: Costin-Gabriel Chiru, Traian Rebedea, Gabriel Bercaru, Ciprian-Octavian Truică
Rok vydání: 2023
Předmět:
Zdroj: Mathematics
Volume 11
Issue 3
Pages: 769
ISSN: 2227-7390
Popis: Intent classification is a central component of a Natural Language Understanding (NLU) pipeline for conversational agents. The quality of such a component depends on the quality of the training data, however, for many conversational scenarios, the data might be scarce; in these scenarios, data augmentation techniques are used. Having general data augmentation methods that can generalize to many datasets is highly desirable. The work presented in this paper is centered around two main components. First, we explore the influence of various feature vectors on the task of intent classification using RASA’s text classification capabilities. The second part of this work consists of a generic method for efficiently augmenting textual corpora using large datasets of unlabeled data. The proposed method is able to efficiently mine for examples similar to the ones that are already present in standard, natural language corpora. The experimental results show that using our corpus augmentation methods enables an increase in text classification accuracy in few-shot settings. Particularly, the gains in accuracy raise up to 16% when the number of labeled examples is very low (e.g., two examples). We believe that our method is important for any Natural Language Processing (NLP) or NLU task in which labeled training data are scarce or expensive to obtain. Lastly, we give some insights into future work, which aims at combining our proposed method with a semi-supervised learning approach.
Databáze: OpenAIRE
Nepřihlášeným uživatelům se plný text nezobrazuje