Pre-trained Language Models with Limited Data for Intent Classification
Autor: | Gour Karmakar, Madhu Chetty, Darren Walls, Buddhika Kasthuriarachchy |
---|---|
Rok vydání: | 2020 |
Předmět: |
Computer science
business.industry Unstructured data 02 engineering and technology 010501 environmental sciences Semantic data model Semantics computer.software_genre 01 natural sciences Data modeling Inductive transfer 0202 electrical engineering electronic engineering information engineering Task analysis 020201 artificial intelligence & image processing Social media Artificial intelligence Language model business computer Natural language processing 0105 earth and related environmental sciences Transformer (machine learning model) |
Zdroj: | IJCNN |
DOI: | 10.1109/ijcnn48605.2020.9207121 |
Popis: | Intent analysis is capturing the attention of both the industry and academia due to its commercial and noncommercial significance. The rapid growth of unstructured data of micro-blogging platforms, such as Twitter and Facebook, are amongst the important sources for intent analysis. However, the social media data are often noisy and diverse, thus making the task very challenging. Further, the intent analysis frequently suffers from lack of sufficient data because the labeled datasets are often manually annotated. Recently, BERT (Bidirectional Encoder Representation from Transformers), a state-of-the-art language representation model, has attracted attention for accurate language modelling. In this paper, we investigate the application of BERT for its suitability for intent analysis. We study the fine-tuning of the BERT model through inductive transfer learning and investigate methods to overcome the challenges due to limited data availability by proposing a novel semantic data augmentation approach. This technique generates synthetic sentences while preserving the label-compatibility using the semantic meaning of the sentences, to improve the intent classification accuracy. Thus, based on the considerations for finetuning and data augmentation, a systematic and novel step-bystep methodology is presented for applying the linguistic model BERT for intent classification with limited data available. Our results show that the pre-trained language can be effectively used with noisy social media data to achieve state-of-the-art accuracy in intent analysis under low labeled-data regime. Moreover, our results also confirm that the proposed text augmentation technique is effective in eliminating noisy synthetic sentences, thereby achieving further performance improvements. |
Databáze: | OpenAIRE |
Externí odkaz: |