Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System
Autor: | Javier Hernando, Jordi Luque, Abraham Woubie Zewoudie |
---|---|
Přispěvatelé: | Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla |
Rok vydání: | 2016 |
Předmět: |
Computer science
Speech recognition Reconeixement automàtic de la parla Context (language use) 02 engineering and technology Viterbi algorithm 01 natural sciences symbols.namesake 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Prosody Cluster analysis Hidden Markov model 010301 acoustics business.industry Automatic speech recognition Speaker error reduction Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing) 020206 networking & telecommunications Pattern recognition Speaker recognition Speaker diarisation Computer Science::Sound symbols Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC] Mel-frequency cepstrum Artificial intelligence business i-vectors |
Zdroj: | Odyssey 2016 Odyssey UPCommons. Portal del coneixement obert de la UPC Universitat Politècnica de Catalunya (UPC) Recercat. Dipósit de la Recerca de Catalunya instname |
ISSN: | 2312-2846 |
DOI: | 10.21437/Odyssey.2016-58 |
Popis: | i-vectors have been successfully applied over the last years in speaker recognition tasks. This work aims at assessing the suitability of i-vector modeling within the frame of speaker diarization task. In such context, a weighted cosine-distance between two different sets of i-vectors is proposed for speaker clustering. Speech clusters generated by Viterbi segmentation are first modeled by two different i-vectors. Whilst the first i-vector represents the distribution of the commonly used short-term Mel Frequency Cepstral Coefficients, the second one depicts a selection of voice quality and prosodic features. In order to combine both short- and long-term speech statistics, the cosine-distance scores of those two i-vectors are linearly weighted to obtain a unique similarity score. The final fused score is then used as speaker clustering distance. Our experimental results on two different evaluation sets of the Augmented Multi-party Interaction corpus show the suitability of combining both sources of information within the i-vector space. Our experimental results show that the use of i-vector based clustering technique provide a significant improvement, in terms of diarization error rate, than those based on Gaussian Mixture Modeling technique. Furthermore, this work also reports a significant speaker error reduction by augmenting short-term based i-vector clustering with a second i-vector estimated from voice quality and prosody related speech features. |
Databáze: | OpenAIRE |
Externí odkaz: |