Popis: |
In building smart cities, there is a need for voice bio-metrics, through which we can validate a person's identity over a wireless medium like mobile. Voice bio-metrics have many advantages over other bio-metrics like retinal scans and fingerprints because in the case of voice bio-metrics there is no need for physical presence. Speaker recognition (SR) is a process of identifying a person from his or her unique voice. In order to build a robust SR system, it has to take care of background noise, channel effect, speaker health, and emotional state of the speaker. The present study focuses on SR in emotional conditions. In this study, the performance of the Gaussian Mixture Model-Universal Background Model (GMM-UBM) is studied in emotional conditions. The experiments are performed on the Amritha emotional database. The emotions considered in the present study are neutral, anger, happiness, and sadness. The results show that the performance of an SR system trained with clean speech is degrading while testing with emotional data. The accuracies are high when training and testing are done on the same emotion. The reasons for the performance aggravation are analyzed using parameters like strength of excitation, energy of excitation, fundamental frequency (F0), and duration. Further analysis is done using the distance between the GMMs employing Kullback-Leibler divergence. The importance of having emotional data in training is highlighted in several experiments. A two-stage emotional SR system is proposed to enhance the recognition accuracy; in the first step, emotional state of the speaker is detected using an emotion recognition system, and then the utterance is validated using the speaker models of recognized emotion. |