Multimodal speaker/speech recognition using lip motion, lip texture and audio

Autor:	A.M. Tekalp, Engin Erzin, Hasan Ertan Cetingul, Yücel Yemez
Rok vydání:	2006
Předmět:	Motion analysis Computer science Speech recognition education ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION GeneralLiterature_MISCELLANEOUS stomatognathic system Minimum bounding box ComputerApplications_MISCELLANEOUS Motion estimation otorhinolaryngologic diseases Discrete cosine transform Computer vision Electrical and Electronic Engineering ComputingMethodologies_COMPUTERGRAPHICS Modality (human–computer interaction) business.industry Speaker recognition Speech processing stomatognathic diseases ComputingMethodologies_PATTERNRECOGNITION Control and Systems Engineering Signal Processing Word recognition Computer Vision and Pattern Recognition Artificial intelligence Mel-frequency cepstrum business psychological phenomena and processes Software
Zdroj:	Signal Processing. 86:3549-3558
ISSN:	0165-1684
Popis:	We present a new multimodal speaker/speech recognition system that integrates audio, lip texture and lip motion modalities. Fusion of audio and face texture modalities has been investigated in the literature before. The emphasis of this work is to investigate the benefits of inclusion of lip motion modality for two distinct cases: speaker and speech recognition. The audio modality is represented by the well-known mel-frequency cepstral coefficients (MFCC) along with the first and second derivatives, whereas lip texture modality is represented by the 2D-DCT coefficients of the luminance component within a bounding box about the lip region. In this paper, we employ a new lip motion modality representation based on discriminative analysis of the dense motion vectors within the same bounding box for speaker/speech recognition. The fusion of audio, lip texture and lip motion modalities is performed by the so-called reliability weighted summation (RWS) decision rule. Experimental results show that inclusion of lip motion modality provides further performance gains over those which are obtained by fusion of audio and lip texture alone, in both speaker identification and isolated word recognition scenarios.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::e864b6349d96c62bcacd2d07b588c315 https://doi.org/10.1016/j.sigpro.2006.02.045 Zobrazit plný text záznamu Full Text from ScienceDirect