Unsupervised question answering data acquisition from local corpora

Autor: Lucian Vlad Lita, Jaime G. Carbonell
Rok vydání: 2004
Předmět:
Zdroj: CIKM
DOI: 10.1145/1031171.1031283
Popis: Data-driven approaches in question answering (QA) are increasingly common. Since availability of training data for such approaches is very limited, we propose an unsupervised algorithm that generates high quality question-answer pairs from local corpora. The algorithm is ontology independent, requiring very small seed data as its starting point. Two alternating views of the data make learning possible: 1) question types are viewed as relations between entities and 2) question types are described by their corresponding question-answer pairs. These two aspects of the data allow us to construct an unsupervised algorithm that acquires high precision question-answer pairs. We show the quality of the acquired data for different question types and perform a task-based evaluation. With each iteration, pairs acquired by the unsupervised algorithm are used as training data to a simple QA system. Performance increases with the number of question-answer pairs acquired confirming the robustness of the unsupervised algorithm. We introduce the notion of semantic drift and show that it is a desirable quality in training data for question answering systems.
Databáze: OpenAIRE