Pre-trained Language Models with Limited Data for Intent Classification

Autor: Gour Karmakar, Madhu Chetty, Darren Walls, Buddhika Kasthuriarachchy
Rok vydání: 2020
Předmět:
Zdroj: IJCNN
DOI: 10.1109/ijcnn48605.2020.9207121
Popis: Intent analysis is capturing the attention of both the industry and academia due to its commercial and noncommercial significance. The rapid growth of unstructured data of micro-blogging platforms, such as Twitter and Facebook, are amongst the important sources for intent analysis. However, the social media data are often noisy and diverse, thus making the task very challenging. Further, the intent analysis frequently suffers from lack of sufficient data because the labeled datasets are often manually annotated. Recently, BERT (Bidirectional Encoder Representation from Transformers), a state-of-the-art language representation model, has attracted attention for accurate language modelling. In this paper, we investigate the application of BERT for its suitability for intent analysis. We study the fine-tuning of the BERT model through inductive transfer learning and investigate methods to overcome the challenges due to limited data availability by proposing a novel semantic data augmentation approach. This technique generates synthetic sentences while preserving the label-compatibility using the semantic meaning of the sentences, to improve the intent classification accuracy. Thus, based on the considerations for finetuning and data augmentation, a systematic and novel step-bystep methodology is presented for applying the linguistic model BERT for intent classification with limited data available. Our results show that the pre-trained language can be effectively used with noisy social media data to achieve state-of-the-art accuracy in intent analysis under low labeled-data regime. Moreover, our results also confirm that the proposed text augmentation technique is effective in eliminating noisy synthetic sentences, thereby achieving further performance improvements.
Databáze: OpenAIRE