The TechQA Dataset

Autor: Saswati Dana, Saurabh Pujar, Radu Florian, Dinesh Khandelwal, Todd Ward, Rong Zhang, Rishav Chakravarti, Lin Pan, Rosario A. Uceda-Sosa, John F. Pitrelli, Salim Roukos, Martin Franz, Anthony Ferritto, Cezar Pendus, Avirup Sil, Mohamed Nasr, J. Scott McCarley, Dinesh Garg, Vittorio Castelli, Mike McCawley, Andrzej Sakrajda
Rok vydání: 2020
Předmět:
Zdroj: ACL
DOI: 10.18653/v1/2020.acl-main.117
Popis: We introduce TechQA, a domain-adaptation question answering dataset for the technical support domain. The TechQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size -- 600 training, 310 dev, and 490 evaluation question/answer pairs -- thus reflecting the cost of creating large labeled datasets with actual data. Consequently, TechQA is meant to stimulate research in domain adaptation rather than being a resource to build QA systems from scratch. The dataset was obtained by crawling the IBM Developer and IBM DeveloperWorks forums for questions with accepted answers that appear in a published IBM Technote---a technical document that addresses a specific technical issue. We also release a collection of the 801,998 publicly available Technotes as of April 4, 2019 as a companion resource that might be used for pretraining, to learn representations of the IT domain language.
Long version of conference paper to be submitted
Databáze: OpenAIRE