Zobrazeno 1 - 10
of 25
pro vyhledávání: '"Xin, Detai"'
Autor:
Xin, Detai, Tan, Xu, Shen, Kai, Ju, Zeqian, Yang, Dongchao, Wang, Yuancheng, Takamichi, Shinnosuke, Saruwatari, Hiroshi, Liu, Shujie, Li, Jinyu, Zhao, Sheng
We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as
Externí odkaz:
http://arxiv.org/abs/2404.03204
Autor:
Watanabe, Aya, Takamichi, Shinnosuke, Saito, Yuki, Nakata, Wataru, Xin, Detai, Saruwatari, Hiroshi
In text-to-speech synthesis, the ability to control voice characteristics is vital for various applications. By leveraging thriving text prompt-based generation techniques, it should be possible to enhance the nuanced control of voice characteristics
Externí odkaz:
http://arxiv.org/abs/2403.13353
Autor:
Ju, Zeqian, Wang, Yuancheng, Shen, Kai, Tan, Xu, Xin, Detai, Yang, Dongchao, Liu, Yanqing, Leng, Yichong, Song, Kaitao, Tang, Siliang, Wu, Zhizheng, Qin, Tao, Li, Xiang-Yang, Ye, Wei, Zhang, Shikun, Bian, Jiang, He, Lei, Li, Jinyu, Zhao, Sheng
While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre,
Externí odkaz:
http://arxiv.org/abs/2403.03100
Autor:
Xin, Detai, Jiang, Junfeng, Takamichi, Shinnosuke, Saito, Yuki, Aizawa, Akiko, Saruwatari, Hiroshi
We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also non
Externí odkaz:
http://arxiv.org/abs/2310.06072
Autor:
Watanabe, Aya, Takamichi, Shinnosuke, Saito, Yuki, Nakata, Wataru, Xin, Detai, Saruwatari, Hiroshi
In text-to-speech, controlling voice characteristics is important in achieving various-purpose speech synthesis. Considering the success of text-conditioned generation, such as text-to-image, free-form text instruction should be useful for intuitive
Externí odkaz:
http://arxiv.org/abs/2309.13509
Autor:
Park, Joonyong, Takamichi, Shinnosuke, Nakamura, Tomohiko, Seki, Kentaro, Xin, Detai, Saruwatari, Hiroshi
We examine the speech modeling potential of generative spoken language modeling (GSLM), which involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis. Since GSLM facilitates textless spoken language pro
Externí odkaz:
http://arxiv.org/abs/2306.00697
We present JNV (Japanese Nonverbal Vocalizations) corpus, a corpus of Japanese nonverbal vocalizations (NVs) with diverse phrases and emotions. Existing Japanese NV corpora lack phrase or emotion diversity, which makes it difficult to analyze NVs and
Externí odkaz:
http://arxiv.org/abs/2305.12445
We present a large-scale in-the-wild Japanese laughter corpus and a laughter synthesis method. Previous work on laughter synthesis lacks not only data but also proper ways to represent laughter. To solve these problems, we first propose an in-the-wil
Externí odkaz:
http://arxiv.org/abs/2305.12442
Pause insertion, also known as phrase break prediction and phrasing, is an essential part of TTS systems because proper pauses with natural duration significantly enhance the rhythm and intelligibility of synthetic speech. However, conventional phras
Externí odkaz:
http://arxiv.org/abs/2302.13652
Autor:
Xin, Detai, Adavanne, Sharath, Ang, Federico, Kulkarni, Ashish, Takamichi, Shinnosuke, Saruwatari, Hiroshi
We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech. Previous work either uses
Externí odkaz:
http://arxiv.org/abs/2211.02336