Detecting Urgency Status of Crisis Tweets: A Transfer Learning Approach for Low Resource Languages
Autor: | Bohan Qu, Efsun Sarioglu Kayi, Mona Diab, Linyong Nan, Kathleen R. McKeown |
---|---|
Rok vydání: | 2020 |
Předmět: |
Low resource
Computer science business.industry Deep learning 02 engineering and technology computer.software_genre Task (project management) Zero (linguistics) 03 medical and health sciences 0302 clinical medicine Classifier (linguistics) 030221 ophthalmology & optometry 0202 electrical engineering electronic engineering information engineering Labeled data 020201 artificial intelligence & image processing Artificial intelligence Baseline (configuration management) business Transfer of learning computer Natural language processing |
Zdroj: | COLING |
Popis: | We release an urgency dataset that consists of English tweets relating to natural crises, along with annotations of their corresponding urgency status. Additionally, we release evaluation datasets for two low-resource languages, i.e. Sinhala and Odia, and demonstrate an effective zero-shot transfer from English to these two languages by training cross-lingual classifiers. We adopt cross-lingual embeddings constructed using different methods to extract features of the tweets, including a few state-of-the-art contextual embeddings such as BERT, RoBERTa and XLM-R. We train classifiers of different architectures on the extracted features. We also explore semi-supervised approaches by utilizing unlabeled tweets and experiment with ensembling different classifiers. With very limited amounts of labeled data in English and zero data in the low resource languages, we show a successful framework of training monolingual and cross-lingual classifiers using deep learning methods which are known to be data hungry. Specifically, we show that the recent deep contextual embeddings are also helpful when dealing with very small-scale datasets. Classifiers that incorporate RoBERTa yield the best performance for English urgency detection task, with F1 scores that are more than 25 points over our baseline classifier. For the zero-shot transfer to low resource languages, classifiers that use LASER features perform the best for Sinhala transfer while XLM-R features benefit the Odia transfer the most. |
Databáze: | OpenAIRE |
Externí odkaz: |