Speech recognition using an english multimodal corpus with integrated image and depth information

Autor:	Bing Wang
Jazyk:	angličtina
Rok vydání:	2024
Předmět:	English multimodal corpus Speech recognition methods Depth information Electronic images Medicine Science
Zdroj:	Scientific Reports, Vol 14, Iss 1, Pp 1-11 (2024)
Druh dokumentu:	article
ISSN:	2045-2322
DOI:	10.1038/s41598-024-78557-2
Popis:	Abstract Traditional English corpora mainly collect information from a single modality, but lack information from multimodal information, resulting in low quality of corpus information and certain problems with recognition accuracy. To solve the above problems, this paper proposes to introduce depth information into multimodal corpora, and studies the construction method of English multimodal corpora that integrates electronic images and depth information, as well as the speech recognition method of the corpus. The multimodal fusion strategy adopted integrates speech signals and image information, including key visual information such as the speaker’s lip movements and facial expressions, and uses deep learning technology to mine acoustic and visual features. The acoustic model in the Kaldi toolkit is used for experimental research.Through experimental research, the following conclusions were drawn: Under 15-dimensional lip features, the accuracy of corpus A under monophone model was 2.4% higher than that of corpus B under monophone model when the SNR (signal-to-noise ratio) was 10dB, and the accuracy of corpus A under the triphone model at the signal-to-noise ratio of 10dB was 1.7% higher than that of corpus B under the triphone model at the signal-to-noise ratio of 10dB. Under the 32-dimensional lip features, the speech recognition effect of corpus A under the monophone model at the SNR of 10dB was 1.4% higher than that of corpus B under the monophone model at the SNR of 10dB, and the accuracy of corpus A under the triphone model at the SNR of 10dB was 2.6% higher than that of corpus B under the triphone model at the SNR of 10dB. The English multimodal corpus with image and depth information has a high accuracy, and the depth information helps to improve the accuracy of the corpus.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/d70f12a97cec45bdb7ff7c5ee4e0594e Zobrazit plný text záznamu View record in DOAJ Plný text ve formátu PDF Plný text ve formátu HTML
Nepřihlášeným uživatelům se plný text nezobrazuje	K zobrazení výsledku je třeba se přihlásit.