Efficient Corpus Creation Method for NLU Using Interview with Probing Questions

Autor:	Hiroaki Kokubo, Jinhua She, Rintaro Ikeshita, Masataka Motohashi, Yasunari Obuchi, Takeshi Homma, Kazuaki Shima
Rok vydání:	2019
Předmět:	Computer science business.industry Natural language understanding 02 engineering and technology computer.software_genre Human-Computer Interaction Artificial Intelligence Morpheme 020204 information systems 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Computer Vision and Pattern Recognition Artificial intelligence business computer Natural language processing
Zdroj:	Journal of Advanced Computational Intelligence and Intelligent Informatics. 23:947-955
ISSN:	1883-8014 1343-0130
DOI:	10.20965/jaciii.2019.p0947
Popis:	This paper presents an efficient method to build a corpus to train natural language understanding (NLU) modules. Conventional corpus creation methods involve a common cycle: a subject is given a specific situation where the subject operates a device by voice, and then the subject speaks one utterance to execute the task. In these methods, many subjects are required in order to build a large-scale corpus, which causes a problem of increasing lead time and financial cost. To solve this problem, we propose to incorporate a “probing question” into the cycle. Specifically, after a subject speaks one utterance, the subject is asked to think of alternative utterances to execute the same task. In this way, we obtain many utterances from a small number of subjects. An evaluation of the proposed method applied to interview-based corpus creation shows that the proposed method reduces the number of subjects by 41% while maintaining morphological diversity in a corpus and morphological coverage for user utterances spoken to commercial devices. It also shows that the proposed method reduces the total time for interviewing subjects by 36% compared with the conventional method. We conclude that the proposed method can be used to build a useful corpus while reducing lead time and financial cost.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::ed89fa1d69d7ec800e5ad004135311c2 https://doi.org/10.20965/jaciii.2019.p0947 Zobrazit plný text záznamu