Speech enhancement by LSTM-based noise suppression followed by CNN-based speech restoration

Autor:	Bruno Defraene, Tim Fingscheidt, Wouter Tirry, Maximilian Strake, Kristoff Fluyt
Jazyk:	angličtina
Rok vydání:	2020
Předmět:	Computer science Speech recognition Speech enhancement Noise suppression lcsh:TK7800-8360 02 engineering and technology Intelligibility (communication) Convolutional neural network Article lcsh:Telecommunication 030507 speech-language pathology & audiology 03 medical and health sciences lcsh:TK5101-6720 0202 electrical engineering electronic engineering information engineering Long short-term memory ddc:6 Veröffentlichung der TU Braunschweig ddc:62 Artificial neural network Speech restoration business.industry Deep learning lcsh:Electronics 020206 networking & telecommunications Two-stage processing Computer Science::Sound Convolutional neural networks Artificial intelligence ddc:620 Publikationsfonds der TU Braunschweig 0305 other medical science business PESQ
Zdroj:	EURASIP Journal on Advances in Signal Processing, Vol 2020, Iss 1, Pp 1-26 (2020) EURASIP Journal on Advances in Signal Processing, 2020, 49 (2020). https://doi.org/10.1186/s13634-020-00707-1--http://asp.eurasipjournals.com/--http://www.bibliothek.uni-regensburg.de/ezeit/?2364203--1687-6180
ISSN:	1687-6180
Popis:	Single-channel speech enhancement in highly non-stationary noise conditions is a very challenging task, especially when interfering speech is included in the noise. Deep learning-based approaches have notably improved the performance of speech enhancement algorithms under such conditions, but still introduce speech distortions if strong noise suppression shall be achieved. We propose to address this problem by using a two-stage approach, first performing noise suppression and subsequently restoring natural sounding speech, using specifically chosen neural network topologies and loss functions for each task. A mask-based long short-term memory (LSTM) network is employed for noise suppression and speech restoration is performed via spectral mapping with a convolutional encoder-decoder network (CED). The proposed method improves speech quality (PESQ) over state-of-the-art single-stage methods by about 0.1 points for unseen highly non-stationary noise types including interfering speech. Furthermore, it is able to increase intelligibility in low-SNR conditions and consistently outperforms all reference methods.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::9e7a3be4309dc4a2b535f94890c37812 https://doaj.org/article/565f8fc0d3a94182b4106ab639dfc73d Zobrazit plný text záznamu Full text from SpringerLink