Reducing Interference with Phase Recovery in DNN-based Monaural Singing Voice Separation

Autor:	Paul Magron, Konstantinos Drossos, Stylianos Ioannis Mimilakis, Tuomas Virtanen
Přispěvatelé:	Magron, Paul, Tampere University of Technology [Tampere] (TUT), Fraunhofer Institute for Digital Media Technology [Ilmenau] (Fraunhofer IDMT), Fraunhofer (Fraunhofer-Gesellschaft)
Rok vydání:	2018
Předmět:	MaD TwinNet Computer science Speech recognition Phase (waves) 02 engineering and technology Monaural Interference (wave propagation) Monaural singing voice separation 030507 speech-language pathology & audiology 03 medical and health sciences symbols.namesake 0202 electrical engineering electronic engineering information engineering Redundancy (engineering) Wiener filtering [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing phase recovery Wiener filter Short-time Fourier transform 020206 networking & telecommunications Sinusoidal model Fourier transform deep neural networks symbols Singing 0305 other medical science [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing
Zdroj:	INTERSPEECH Tampere University Interspeech 2018 Interspeech Interspeech, Sep 2018, Hyderabad, India
DOI:	10.21437/interspeech.2018-1845
Popis:	International audience; State-of-the-art methods for monaural singing voice separation consist in estimating the magnitude spectrum of the voice in the short-term Fourier transform (STFT) domain by means of deep neural networks (DNNs). The resulting magnitude estimate is then combined with the mixture's phase to retrieve the complex-valued STFT of the voice, which is further synthesized into a time-domain signal. However, when the sources overlap in time and frequency, the STFT phase of the voice differs from the mixture's phase, which results in interference and artifacts in the estimated signals. In this paper, we investigate on recent phase recovery algorithms that tackle this issue and can further enhance the separation quality. These algorithms exploit phase constraints that originate from a sinusoidal model or from consistency , a property that is a direct consequence of the STFT redundancy. Experiments conducted on real music songs show that those algorithms are efficient for reducing interference in the estimated voice compared to the baseline approach.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::3bce260d1bfd9602802dc6778a485ba4 https://doi.org/10.21437/interspeech.2018-1845 Zobrazit plný text záznamu