Autor: |
Yoshihiko Nankaku, Keiichiro Oura, Yukiya Hono, Kei Sawada, Keiichi Tokuda, Kei Hashimoto, Koki Senda |
Rok vydání: |
2018 |
Předmět: |
|
Zdroj: |
APSIPA |
DOI: |
10.23919/apsipa.2018.8659568 |
Popis: |
This paper proposes a method of selecting training data for many-to-one singing voice conversion (VC) from waveform data on the social media music app “nana.” On this social media app, users can share sounds such as speaking, singing, and instrumental music recorded by their smartphones. The number of hours of accumulated waveform data has exceeded one million, and it is regarded as “big data.” It is widely known that big data can create huge values by advanced deep learning technology. A lot of post data of multiple users having sung the same song is contained in nana's database. This data is considered suitable training data for VC. This is because VC frameworks based on statistical approaches often require parallel data sets that consist of pairs of waveform data of source and target singers who sing the same phrases. The proposed method can compose parallel data sets that can be used for many-to-one statistical VCs from nana's database by extracting frames that have small differences in the timing of utterances, based on the results of dynamic programming (DP) matching. Experimental results indicate that a system that uses training data composed by our method can convert acoustic features more accurately than a system that does not use the method. |
Databáze: |
OpenAIRE |
Externí odkaz: |
|