Temporally Dynamic Spiking Transformer Network for Speech Enhancement

Autor:	Manal Abdullah Alohali, Nasir Saleem, Delel Rhouma, Mohamed Medani, Hela Elmannai, Sami Bourouis
Jazyk:	angličtina
Rok vydání:	2024
Předmět:	Speech enhancement speech recognition deep learning spiking transformer temporal dynamics spiking self-attention (SSA) Electrical engineering. Electronics. Nuclear engineering TK1-9971
Zdroj:	IEEE Access, Vol 12, Pp 146513-146526 (2024)
Druh dokumentu:	article
ISSN:	2169-3536
DOI:	10.1109/ACCESS.2024.3444596
Popis:	Speech enhancement (SE) aims to improve the quality and intelligibility of speech signals, particularly in the presence of noise or other distortions, to ensure reliable communication and robust speech recognition. Deep neural networks (DNNs) have shown remarkable success in SE due to their ability to learn complex patterns and representations from large amounts of data. However, they face limitations in handling long-term temporal sequences. Spiking neural networks and transformers inherently manage temporal data and capture fine-grained temporal patterns in speech signals. This paper proposes a model that integrates self-attention with spiking neural networks for speech enhancement. The proposed model employs a convolutional encoder-decoder architecture with a spiking transformer acting as a bottleneck network. The spiking self-attention mechanism in this framework represents features using spike-based queries, keys, and values. This approach enhances features by effectively capturing temporal dependencies and contextual relationships in speech signals. The spiking transformer is divided into two branches to capture comprehensive global dependencies across the temporal and spectral dimensions. The encoder-decoder incorporates a multi-scale feature extractor, which extracts features at various scales, enabling the model to build a comprehensive hierarchical representation. This representation significantly enhances the model’s ability to learn and process noisy speech, leading to excellent SE performance. Experiments are conducted using two publicly available benchmark datasets: WSJO-SI84 and VCTK+DEMAND. The proposed model demonstrated improved SE performance, showing significant progress with notable improvements of 33.69% in ESTOI, 1.05 in PESQ, and 11.36 dB in SDR over the noisy mixtures.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/f5715045dd764ffa91fafbb7068aa1a2 Zobrazit plný text záznamu View record in DOAJ