Analysis and Tuning of a Voice Assistant System for Dysfluent Speech

Autor:	Sachin S. Kajarekar, Vikramjit Mitra, Panayiotis G. Georgiou, Jeffrey P. Bigham, Darren Botten, Sarah Wu, Lauren Tooley, Zifang Huang, Ashwini Palekar, Colin Lea, Shrinath Thelapurath
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	FOS: Computer and information sciences Sound (cs.SD) Computer Science - Machine Learning Stuttering Computer science Computer Science - Artificial Intelligence Speech recognition Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Pronunciation Computer Science - Sound Machine Learning (cs.LG) Fluency Speech recognition performance Voice assistant Audio and Speech Processing (eess.AS) medicine FOS: Electrical engineering electronic engineering information engineering Computer Science - Computation and Language Focus (linguistics) Artificial Intelligence (cs.AI) medicine.symptom Computation and Language (cs.CL) Decoding methods Word (computer architecture) Electrical Engineering and Systems Science - Audio and Speech Processing
Popis:	Dysfluencies and variations in speech pronunciation can severely degrade speech recognition performance, and for many individuals with moderate-to-severe speech disorders, voice operated systems do not work. Current speech recognition systems are trained primarily with data from fluent speakers and as a consequence do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks. The focus of this work is on quantitative analysis of a consumer speech recognition system on individuals who stutter and production-oriented approaches for improving performance for common voice assistant tasks (i.e., "what is the weather?"). At baseline, this system introduces a significant number of insertion and substitution errors resulting in intended speech Word Error Rates (isWER) that are 13.64\% worse (absolute) for individuals with fluency disorders. We show that by simply tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24\% (relative) for individuals with fluency disorders. Tuning these parameters translates to 3.6\% better domain recognition and 1.7\% better intent recognition relative to the default setup for the 18 study participants across all stuttering severities. 5 pages, 1 page reference, 2 figures
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::c01e4ccd12224326af85ba77852e0b70 http://arxiv.org/abs/2106.11759 Zobrazit plný text záznamu