Zobrazeno 1 - 10
of 140
pro vyhledávání: '"Takamichi, Shinnosuke"'
This paper introduces CocoNut-Humoresque, an open-source large-scale speech likability corpus that includes speech segments and their per-listener likability scores. Evaluating voice likability is essential to designing preferable voices for speech s
Externí odkaz:
http://arxiv.org/abs/2407.04270
Autor:
Seki, Kentaro, Takamichi, Shinnosuke, Takamune, Norihiro, Saito, Yuki, Imamura, Kanami, Saruwatari, Hiroshi
This paper proposes a new task called spatial voice conversion, which aims to convert a target voice while preserving spatial information and non-target signals. Traditional voice conversion methods focus on single-channel waveforms, ignoring the ste
Externí odkaz:
http://arxiv.org/abs/2406.17722
Autor:
Igarashi, Takuto, Saito, Yuki, Seki, Kentaro, Takamichi, Shinnosuke, Yamamoto, Ryuichi, Tachibana, Kentaro, Saruwatari, Hiroshi
We propose noise-robust voice conversion (VC) which takes into account the recording quality and environment of noisy source speech. Conventional denoising training improves the noise robustness of a VC model by learning noisy-to-clean VC process. Ho
Externí odkaz:
http://arxiv.org/abs/2406.07280
Autor:
Saito, Yuki, Igarashi, Takuto, Seki, Kentaro, Takamichi, Shinnosuke, Yamamoto, Ryuichi, Tachibana, Kentaro, Saruwatari, Hiroshi
We present SRC4VC, a new corpus containing 11 hours of speech recorded on smartphones by 100 Japanese speakers. Although high-quality multi-speaker corpora can advance voice conversion (VC) technologies, they are not always suitable for testing VC wh
Externí odkaz:
http://arxiv.org/abs/2406.07254
Autor:
Li, Xinjian, Takamichi, Shinnosuke, Saeki, Takaaki, Chen, William, Shiota, Sayaka, Watanabe, Shinji
In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube spe
Externí odkaz:
http://arxiv.org/abs/2406.00899
Autor:
Xin, Detai, Tan, Xu, Shen, Kai, Ju, Zeqian, Yang, Dongchao, Wang, Yuancheng, Takamichi, Shinnosuke, Saruwatari, Hiroshi, Liu, Shujie, Li, Jinyu, Zhao, Sheng
We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as
Externí odkaz:
http://arxiv.org/abs/2404.03204
Autor:
Watanabe, Aya, Takamichi, Shinnosuke, Saito, Yuki, Nakata, Wataru, Xin, Detai, Saruwatari, Hiroshi
In text-to-speech synthesis, the ability to control voice characteristics is vital for various applications. By leveraging thriving text prompt-based generation techniques, it should be possible to enhance the nuanced control of voice characteristics
Externí odkaz:
http://arxiv.org/abs/2403.13353
While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper proposes refe
Externí odkaz:
http://arxiv.org/abs/2401.16812
Autor:
Xin, Detai, Jiang, Junfeng, Takamichi, Shinnosuke, Saito, Yuki, Aizawa, Akiko, Saruwatari, Hiroshi
We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also non
Externí odkaz:
http://arxiv.org/abs/2310.06072
Autor:
Watanabe, Aya, Takamichi, Shinnosuke, Saito, Yuki, Nakata, Wataru, Xin, Detai, Saruwatari, Hiroshi
In text-to-speech, controlling voice characteristics is important in achieving various-purpose speech synthesis. Considering the success of text-conditioned generation, such as text-to-image, free-form text instruction should be useful for intuitive
Externí odkaz:
http://arxiv.org/abs/2309.13509