Arabic duplicate questions detection based on contextual representation, class label matching, and structured self attention

Autor: Alami Hamza, Said El Alaoui Ouatik, Khalid Alaoui Zidani, Noureddine En-Nahnahi
Jazyk: angličtina
Rok vydání: 2022
Předmět:
Zdroj: Journal of King Saud University: Computer and Information Sciences, Vol 34, Iss 6, Pp 3758-3765 (2022)
Druh dokumentu: article
ISSN: 1319-1578
44344759
DOI: 10.1016/j.jksuci.2020.11.032
Popis: Question Answering Systems (QAS) are rising solutions providing exact and precise answers to natural questions. Duplicate Question Detection (DQD), which aims to reuse previous answers, has shown its ability to improve user experience and reduce significantly the response time. However, few Arabic QAS integrate solutions able to detect duplicate questions in their workflow. In this paper, we build a DQD method based on contextual word representation, question classification and forward/backward structured self attention. First, we extract contextual word representation Embeddings from Language Models (ELMo) to map questions into a vector space. Next, we train two models to classify question embedding according to two taxonomies: Hamza et al. and Li & Roth. Then, we introduce a class label matching step to filter out questions that have different class labels. Finally, we propose a Bidirectional Attention Bidirectional LSTM (BiAttention BiLSTM) model that focuses only on keywords to predict whether a question pair is a duplicate or not. We also apply a data augmentation strategy based on symmetry, reflexivity, and transitivity relations to improve the generalization of our model. Various experimentations are performed to evaluate the impact of question classification and pre-processing step on DQD model. The obtained results show that our model achieves good performances as compared to the baseline results.
Databáze: Directory of Open Access Journals