Self-attention encoding and pooling for speaker recognition

Autor: Pooyan Safari, Javier Hernando, Miquel India
Přispěvatelé: Universitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
Rok vydání: 2020
Předmět:
FOS: Computer and information sciences
Computer Science - Machine Learning
0209 industrial biotechnology
Sound (cs.SD)
Computer science
Self-attention encoding
Speech recognition
Pooling
02 engineering and technology
Computer Science - Sound
Machine Learning (cs.LG)
Reduction (complexity)
020901 industrial engineering & automation
Discriminative model
Natural language processing (Computer science)
Audio and Speech Processing (eess.AS)
Encoding (memory)
0202 electrical engineering
electronic engineering
information engineering

FOS: Electrical engineering
electronic engineering
information engineering

Tractament del llenguatge natural (Informàtica)
Representation (mathematics)
Transformer (machine learning model)
Speaker recognition
Speaker embedding
Self-attention pool-ing
Speaker verification
Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC]
020201 artificial intelligence & image processing
Mobile device
Electrical Engineering and Systems Science - Audio and Speech Processing
Zdroj: UPCommons. Portal del coneixement obert de la UPC
Universitat Politècnica de Catalunya (UPC)
INTERSPEECH
DOI: 10.48550/arxiv.2008.01077
Popis: The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong performance on a variety of Natural Language Processing (NLP) applications. Inspired by the Transformer, we propose a tandem Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP is a stack of identical blocks solely relied on self-attention and position-wise feed-forward networks to create vector representation of speakers. This approach encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification. We have evaluated this approach on both VoxCeleb1 & 2 datasets. The proposed architecture is able to outperform the baseline x-vector, and shows competitive performance to some other benchmarks based on convolutions, with a significant reduction in model size. It employs 94%, 95%, and 73% less parameters compared to ResNet-34, ResNet-50, and x-vector, respectively. This indicates that the proposed fully attention based architecture is more efficient in extracting time-invariant features from speaker utterances. This work was supported in part by the Spanish Project DeepVoice (TEC2015-69266-P).
Databáze: OpenAIRE