Abstrakt: |
Traditional recognition methods often lead to problems such as speaker information loss and reduced recognition rates. To address these problems, an Fca-Res2Net speaker recognition model incorporating a self-attentive mechanism is proposed in this paper. First, the model uses the modified mel-frequency cepstral coefficients (MFCCs) as the system feature input and combines the inverse mel-frequency cepstral coefficients (IMFCCs) with the MFCCs as the base input features to extract more representative speech spectral features. On this basis, the difference parameters △MFCC and △IMFCC are fused to fully extract the speech dynamic and static features in the high- and low-frequency bands. Second, frequency channel attention networks (FcaNets) are introduced on top of the baseline model (Res2Net: a new multiscale backbone architecture), and the residual module is used to fuse the shallow and deep speaker features to better obtain the different feature channel weights without increasing the number of parameters. In addition, to better introduce temporal information and capture long-span speech features, the self-attention mechanism is integrated to enhance the long-span modelling of speech features. Finally, the classification output results are identified. Experimental results show that the proposed model improves the recognition rate and robustness of speakers in long speech when compared with the current mainstream speaker recognition methods in the VoxCeleb dataset with sufficient data volume. [ABSTRACT FROM AUTHOR] |