Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Autor: Javier Hernando, Jordi Luque, Abraham Woubie Zewoudie
Přispěvatelé: Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
Rok vydání: 2016
Předmět:
Computer science
Speech recognition
Reconeixement automàtic de la parla
Context (language use)
02 engineering and technology
Viterbi algorithm
01 natural sciences
symbols.namesake
0103 physical sciences
0202 electrical engineering
electronic engineering
information engineering

Prosody
Cluster analysis
Hidden Markov model
010301 acoustics
business.industry
Automatic speech recognition
Speaker error reduction
Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
020206 networking & telecommunications
Pattern recognition
Speaker recognition
Speaker diarisation
Computer Science::Sound
symbols
Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC]
Mel-frequency cepstrum
Artificial intelligence
business
i-vectors
Zdroj: Odyssey 2016
Odyssey
UPCommons. Portal del coneixement obert de la UPC
Universitat Politècnica de Catalunya (UPC)
Recercat. Dipósit de la Recerca de Catalunya
instname
ISSN: 2312-2846
DOI: 10.21437/Odyssey.2016-58
Popis: i-vectors have been successfully applied over the last years in speaker recognition tasks. This work aims at assessing the suitability of i-vector modeling within the frame of speaker diarization task. In such context, a weighted cosine-distance between two different sets of i-vectors is proposed for speaker clustering. Speech clusters generated by Viterbi segmentation are first modeled by two different i-vectors. Whilst the first i-vector represents the distribution of the commonly used short-term Mel Frequency Cepstral Coefficients, the second one depicts a selection of voice quality and prosodic features. In order to combine both short- and long-term speech statistics, the cosine-distance scores of those two i-vectors are linearly weighted to obtain a unique similarity score. The final fused score is then used as speaker clustering distance. Our experimental results on two different evaluation sets of the Augmented Multi-party Interaction corpus show the suitability of combining both sources of information within the i-vector space. Our experimental results show that the use of i-vector based clustering technique provide a significant improvement, in terms of diarization error rate, than those based on Gaussian Mixture Modeling technique. Furthermore, this work also reports a significant speaker error reduction by augmenting short-term based i-vector clustering with a second i-vector estimated from voice quality and prosody related speech features.
Databáze: OpenAIRE