Unsupervised question answering data acquisition from local corpora
Autor: | Lucian Vlad Lita, Jaime G. Carbonell |
---|---|
Rok vydání: | 2004 |
Předmět: |
Training set
Point (typography) Computer science business.industry Construct (python library) Ontology (information science) Machine learning computer.software_genre Task (project management) Data acquisition Robustness (computer science) Question answering Unsupervised learning Artificial intelligence business computer Natural language processing |
Zdroj: | CIKM |
DOI: | 10.1145/1031171.1031283 |
Popis: | Data-driven approaches in question answering (QA) are increasingly common. Since availability of training data for such approaches is very limited, we propose an unsupervised algorithm that generates high quality question-answer pairs from local corpora. The algorithm is ontology independent, requiring very small seed data as its starting point. Two alternating views of the data make learning possible: 1) question types are viewed as relations between entities and 2) question types are described by their corresponding question-answer pairs. These two aspects of the data allow us to construct an unsupervised algorithm that acquires high precision question-answer pairs. We show the quality of the acquired data for different question types and perform a task-based evaluation. With each iteration, pairs acquired by the unsupervised algorithm are used as training data to a simple QA system. Performance increases with the number of question-answer pairs acquired confirming the robustness of the unsupervised algorithm. We introduce the notion of semantic drift and show that it is a desirable quality in training data for question answering systems. |
Databáze: | OpenAIRE |
Externí odkaz: |