Efficient Corpus Creation Method for NLU Using Interview with Probing Questions
Autor: | Hiroaki Kokubo, Jinhua She, Rintaro Ikeshita, Masataka Motohashi, Yasunari Obuchi, Takeshi Homma, Kazuaki Shima |
---|---|
Rok vydání: | 2019 |
Předmět: |
Computer science
business.industry Natural language understanding 02 engineering and technology computer.software_genre Human-Computer Interaction Artificial Intelligence Morpheme 020204 information systems 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Computer Vision and Pattern Recognition Artificial intelligence business computer Natural language processing |
Zdroj: | Journal of Advanced Computational Intelligence and Intelligent Informatics. 23:947-955 |
ISSN: | 1883-8014 1343-0130 |
DOI: | 10.20965/jaciii.2019.p0947 |
Popis: | This paper presents an efficient method to build a corpus to train natural language understanding (NLU) modules. Conventional corpus creation methods involve a common cycle: a subject is given a specific situation where the subject operates a device by voice, and then the subject speaks one utterance to execute the task. In these methods, many subjects are required in order to build a large-scale corpus, which causes a problem of increasing lead time and financial cost. To solve this problem, we propose to incorporate a “probing question” into the cycle. Specifically, after a subject speaks one utterance, the subject is asked to think of alternative utterances to execute the same task. In this way, we obtain many utterances from a small number of subjects. An evaluation of the proposed method applied to interview-based corpus creation shows that the proposed method reduces the number of subjects by 41% while maintaining morphological diversity in a corpus and morphological coverage for user utterances spoken to commercial devices. It also shows that the proposed method reduces the total time for interviewing subjects by 36% compared with the conventional method. We conclude that the proposed method can be used to build a useful corpus while reducing lead time and financial cost. |
Databáze: | OpenAIRE |
Externí odkaz: |