Training deep neural networks with non-uniform frame-level cost function for automatic speech recognition

Autor:	Aldonso Becerra, J. Ismael de la Rosa, Efrén González, A. David Pedroza, N. Iracemi Escalante
Rok vydání:	2018
Předmět:	Artificial neural network Computer Networks and Communications Computer science media_common.quotation_subject Speech recognition Posterior probability Frame (networking) Word error rate 02 engineering and technology Ambiguity 030507 speech-language pathology & audiology 03 medical and health sciences Hardware and Architecture 0202 electrical engineering electronic engineering information engineering Media Technology 020201 artificial intelligence & image processing 0305 other medical science Function (engineering) Software Word (computer architecture) media_common
Zdroj:	Multimedia Tools and Applications. 77:27231-27267
ISSN:	1573-7721 1380-7501
DOI:	10.1007/s11042-018-5917-5
Popis:	The aim of this paper is to exhibit two new variations of the frame-level cost function for training a deep neural network in order to achieve better word error rates in speech recognition. Optimization methods and their minimization functions are underlying aspects to consider when someone is working on neural nets, and hence their improvement is one of the salient objectives of researchers, and this paper deals in part with such a situation. The first proposed framework is based on the concept of extropy, the complementary dual function of an uncertainty measure. The conventional cross-entropy function can be mapped to a non-uniform loss function based on its corresponding extropy, enhancing the frames that have ambiguity in their belonging to specific senones. The second proposal makes a fusion of the presented mapped cross-entropy function and the idea of boosted cross-entropy, which emphasizes those frames with low target posterior probability. The proposed approaches have been performed by using a personalized mid-vocabulary speaker-independent voice corpus. This dataset is employed for recognition of digit strings and personal name lists in Spanish from the northern central part of Mexico on a connected-words phone dialing task. A relative word error rate improvement of $12.3\%$ and $10.7\%$ is obtained with the two proposed approaches, respectively, with regard to the conventional well-established cross-entropy objective function.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::6f27fa2e957637a8950740cfef627d87 https://doi.org/10.1007/s11042-018-5917-5 Zobrazit plný text záznamu Full text from SpringerLink