U-NET: A Supervised Approach for Monaural Source Separation.

Autor: Basir, Samiul, Hossain, Md. Nahid, Hosen, Md. Shakhawat, Ali, Md. Sadek, Riaz, Zainab, Islam, Md. Shohidul
Předmět:
Zdroj: Arabian Journal for Science & Engineering (Springer Science & Business Media B.V. ); Sep2024, Vol. 49 Issue 9, p12679-12691, 13p
Abstrakt: Separating speech is a challenging area of research, especially when trying to separate the desired source from its combination. Deep learning has arisen as a promising solution, surpassing traditional methods. While prior research has mainly focused on the magnitude, log-magnitude, or a combination of the magnitude and phase portions, a new approach using the Short-time Fourier Transform (STFT), and a deep Convolutional Neural Network named U-NET has been proposed. This method, unlike others, considers both the real and imaginary components for decomposition. During the training stage, the mixed time-domain signal undergoes a transformation into a frequency-domain signal by using STFT, producing a mixed complex spectrogram. The spectrogram's real and imaginary parts are then divided and combined into a single matrix. The newly formed matrix is fed through U-NET to extract the source components. The same process is repeated at testing. The resulting concatenated matrix for the mixed test signal is passed through the saved model to generate two enhanced concatenated matrices for each source. These matrices are then transformed back into time-domain signals using inverse STFT by extracting the magnitude and phase. The proposed approach has been evaluated using the GRID audio visual corpuses, with results showing improved quality and intelligibility compared to the existing methods, as demonstrated by objective measurement metrics. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index