ViReader: A Wikipedia-based Vietnamese reading comprehension system using transfer learning
Autor: | Nhat Duy Nguyen, Ngan Luu-Thuy Nguyen, Anh Gia-Tuan Nguyen, Kiet Van Nguyen, Phong Nguyen-Thuan Do |
---|---|
Rok vydání: | 2021 |
Předmět: |
Statistics and Probability
Computer science Vietnamese General Engineering 02 engineering and technology language.human_language Linguistics Reading comprehension Artificial Intelligence 020204 information systems 0202 electrical engineering electronic engineering information engineering language 020201 artificial intelligence & image processing Transfer of learning |
Zdroj: | Journal of Intelligent & Fuzzy Systems. 41:1993-2011 |
ISSN: | 1875-8967 1064-1246 |
Popis: | Machine Reading Comprehension has attracted significant interest in research on natural language understanding, and large-scale datasets and neural network-based methods have been developed for this task. However, most developments of resources and methods in machine reading comprehension have been investigated using two resource-rich languages, English and Chinese. This article proposes a system called ViReader for open-domain machine reading comprehension in Vietnamese by using Wikipedia as the textual knowledge source, where the answer to any particular question is a textual span derived directly from texts on Vietnamese Wikipedia. Our system combines a sentence retriever component, based on techniques of information retrieval to extract the relevant sentences, with a transfer learning-based answer extractor trained to predict answers based on Wikipedia texts. Experiments on multiple datasets for machine reading comprehension in Vietnamese and other languages demonstrate that (1) our ViReader system is highly competitive with prevalent machine learning-based systems, and (2) multi-task learning by using a combination consisting of the sentence retriever and answer extractor is an end-to-end reading comprehension system. The sentence retriever component of our proposed system retrieves the sentences that are most likely to provide the answer response to the given question. The transfer learning-based answer extractor then reads the document from which the sentences have been retrieved, predicts the answer, and returns it to the user. The ViReader system achieves new state-of-the-art performances, with values of 70.83% EM (exact match) and 89.54% F1, outperforming the BERT-based system by 11.55% and 9.54% , respectively. It also obtains state-of-the-art performance on UIT-ViNewsQA (another Vietnamese dataset consisting of online health-domain news) and BiPaR (a bilingual dataset on English and Chinese novel texts). Compared with the BERT-based system, our system achieves significant improvements (in terms of F1) with 7.65% for English and 6.13% for Chinese on the BiPaR dataset. Furthermore, we build a ViReader application programming interface that programmers can employ in Artificial Intelligence applications. |
Databáze: | OpenAIRE |
Externí odkaz: |