Popis: |
In this thesis, a number of important issues relating to the use of both audio and video information for speech and speaker recognition are investigated. A comprehensive comparison of different visual feature types is given, including both geometric and image transformation based features. A new geometric based method for feature extraction is described, as well as the novel use of curvelet based features. Different methods for constructing the feature vectors are compared, as well as feature vector sizes and the use of dynamic features. Each feature type is tested against three types of visual noise: compression, blurring and jitter. A novel method of integrating the audio and video information streams called the maximum stream posterior (MSP) is described. This method is tested in both speaker dependent and speaker independent audio-visual speech recognition (AVSR) systems, and is shown to be robust to noise in either the audio or video streams, given no prior knowledge of the noise. This method is then extended to form the maximum weighted stream posterior (MWSP) method. Finally, both the MSP and MWSP are tested in an audio-visual speaker recognition system (AVSpR). / Experiments using the XM2VTS database will show that both of these methods can outperform, _.','/ standard methods in terms of recognition accuracy in situations where either stream is corrupted. |