Combining Acoustic Embeddings and Decoding Features for End-of-Utterance Detection in Real-Time Far-Field Speech Recognition Systems

Autor:	Chengyuan Ma, Bjorn Hoffmeister, Roland Maas, Guitang Lan, Gautam Tiwari, Shaun N. Joseph, Kyle Goehner, Ariya Rastrow
Rok vydání:	2018
Předmět:	Artificial neural network Computer science Speech recognition Feature extraction Feed forward Latency (audio) Inference 020206 networking & telecommunications 02 engineering and technology 030507 speech-language pathology & audiology 03 medical and health sciences 0202 electrical engineering electronic engineering information engineering Feature (machine learning) 0305 other medical science Utterance Decoding methods
Zdroj:	ICASSP
DOI:	10.1109/icassp.2018.8461478
Popis:	We present an end-of-utterance detector for real-time automatic speech recognition in far-field scenarios. The proposed system consists of three components: a long short-term memory (LSTM) neural network trained on acoustic features, an LSTM trained on l-best recognition hypotheses of the automatic speech recognition (ASR) decoder, and a feedforward deep neural network (DNN) combining embeddings derived from both LSTMs with pause duration features from the ASR decoder. At inference time, lower and upper latency (pause duration) bounds act as safeguards. Within the latency bounds, the utterance end-point is triggered as soon as the DNN posterior reaches a tuned threshold. Our experimental evaluation is carried out on real recordings of natural human interactions with voice-controlled far-field devices. We show that the acoustic embeddings are the single most powerful feature and particularly suitable for cross-lingual applications. We furthermore show the benefit of ASR decoder features, especially as a low cost alternative to ASR hypothesis em-beddings.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::b7a0cfbcd2c7bd57c765872d21923300 https://doi.org/10.1109/icassp.2018.8461478 Zobrazit plný text záznamu