HiLAM-state discriminative multi-task deep neural network in dynamic time warping framework for text-dependent speaker verification
Autor: | Mohammad Azharuddin Laskar, Rabul Hussain Laskar |
---|---|
Rok vydání: | 2020 |
Předmět: |
Linguistics and Language
Dynamic time warping Artificial neural network Computer science Communication Speech recognition Acoustic model Word error rate 020206 networking & telecommunications Context (language use) 02 engineering and technology Mixture model 01 natural sciences Language and Linguistics Computer Science Applications Discriminative model Modeling and Simulation 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Computer Vision and Pattern Recognition Hidden Markov model 010301 acoustics Software |
Zdroj: | Speech Communication. 121:29-43 |
ISSN: | 0167-6393 |
DOI: | 10.1016/j.specom.2020.03.007 |
Popis: | This paper builds on a multi-task Deep Neural Network (DNN), which provides an utterance-level feature representation called j-vector, to implement a Text-dependent Speaker Verification (TDSV) system. This technique exploits the speaker idiosyncrasies associated with individual pass-phrases. However, speaker information is known to be characteristic of more specific speech units and, thus, it is likely that important speaker identity traits might get averaged out if it is considered as a coarse entity spread uniformly across the whole pass-phrase. This work attempts to overcome this limitation and devises a technique to leverage the finer speaker traits. It proposes to align the training data for Multi-task DNN using Hierarchical Multi-Layer Acoustic Model (HiLAM). HiLAM is an HMM-based text-dependent model that defines refined segments of a pass-phrase using Gaussian Mixture Model (GMM) states. This helps to exploit the speaker idiosyncrasies associated with finer and more specific segments of speech. Also, as HiLAM is built using the particular text in question, this alignment technique automatically takes care of the exact context of the speech units in the concerned pass-phrase. The proposed technique has been found to improve the performance of the system significantly. Integrating Dynamic Time Warping (DTW) with this technique leads to further improvement in the performance of the system. Experiments have been validated on Part 1 of RSR2015, RedDots, and NITS-TD databases. The best-performing proposed system achieves a relative Equal Error Rate (EER) reduction of up to 50.98% with respect to the baseline j-vector-based system for the overall test condition in case of RSR2015 database. |
Databáze: | OpenAIRE |
Externí odkaz: |