High-Resolution Representation Learning and Recurrent Neural Network for Singing Voice Separation

Autor:	Bhuwan Bhattarai, Yagya Raj Pandeya, You Jie, Arjun Kumar Lamichhane, Joonwhoan Lee
Rok vydání:	2022
Předmět:	Applied Mathematics Signal Processing
Zdroj:	Circuits, Systems, and Signal Processing. 42:1083-1104
ISSN:	1531-5878 0278-081X
DOI:	10.1007/s00034-022-02166-5
Popis:	Music source separation has traditionally followed the encoder-decoder paradigm (e.g., hourglass, U-Net, DeconvNet, SegNet) to isolate individual music components from mixtures. Such networks, however, result in a loss of location-sensitivity, as low-resolution representation drops the useful harmonic patterns over the temporal dimension. We overcame this problem by performing singing voice separation using a high-resolution representation learning (HRNet) system coupled with a long short-term memory (LSTM) module to retain high-resolution feature map and capture the temporal behavior of the acoustic signal. We called this joint combination of HRNet and LSTM as HR-LSTM. The predicted spectrograms produced by this system are close to ground truth and successfully separate music sources, achieving results superior to those realized by past methods. The proposed network was tested using four datasets (DSD100, MIR-1K, Korean Pansori, and Nepal Idol singing voice). Our experiments confirmed that the proposed HR-LSTM outperforms state-of-the-art networks at singing voice separation when the DSD100 dataset is used, performs comparably to alternative methods when the MIR-1K dataset is used, and separates the voice and accompaniment components well when the Pansori and NISVS datasets are used. In addition to proposing and validating our network, we also developed and shared our Nepal Idol dataset.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::676eaa9339930f7f8a23a0e61adcb5b9 https://doi.org/10.1007/s00034-022-02166-5 Zobrazit plný text záznamu Full text from SpringerLink