Výsledky vyhledávání

Report

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

Autor: Xin, Detai, Tan, Xu, Shen, Kai, Ju, Zeqian, Yang, Dongchao, Wang, Yuancheng, Takamichi, Shinnosuke, Saruwatari, Hiroshi, Liu, Shujie, Li, Jinyu, Zhao, Sheng

We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as

Externí odkaz: http://arxiv.org/abs/2404.03204

Zobrazit plný text záznamu

Report

Building speech corpus with diverse voice characteristics for its prompt-based representation

Autor: Watanabe, Aya, Takamichi, Shinnosuke, Saito, Yuki, Nakata, Wataru, Xin, Detai, Saruwatari, Hiroshi

In text-to-speech synthesis, the ability to control voice characteristics is vital for various applications. By leveraging thriving text prompt-based generation techniques, it should be possible to enhance the nuanced control of voice characteristics

Externí odkaz: http://arxiv.org/abs/2403.13353

Zobrazit plný text záznamu

Report

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Autor: Ju, Zeqian, Wang, Yuancheng, Shen, Kai, Tan, Xu, Xin, Detai, Yang, Dongchao, Liu, Yanqing, Leng, Yichong, Song, Kaitao, Tang, Siliang, Wu, Zhizheng, Qin, Tao, Li, Xiang-Yang, Ye, Wei, Zhang, Shikun, Bian, Jiang, He, Lei, Li, Jinyu, Zhao, Sheng

While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre,

Externí odkaz: http://arxiv.org/abs/2403.03100

Zobrazit plný text záznamu

Report

JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions

Autor: Xin, Detai, Jiang, Junfeng, Takamichi, Shinnosuke, Saito, Yuki, Aizawa, Akiko, Saruwatari, Hiroshi

We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also non

Externí odkaz: http://arxiv.org/abs/2310.06072

Zobrazit plný text záznamu

Report

Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control

Autor: Watanabe, Aya, Takamichi, Shinnosuke, Saito, Yuki, Nakata, Wataru, Xin, Detai, Saruwatari, Hiroshi

In text-to-speech, controlling voice characteristics is important in achieving various-purpose speech synthesis. Considering the success of text-conditioned generation, such as text-to-image, free-form text instruction should be useful for intuitive

Externí odkaz: http://arxiv.org/abs/2309.13509

Zobrazit plný text záznamu

Report

How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics

Autor: Park, Joonyong, Takamichi, Shinnosuke, Nakamura, Tomohiko, Seki, Kentaro, Xin, Detai, Saruwatari, Hiroshi

We examine the speech modeling potential of generative spoken language modeling (GSLM), which involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis. Since GSLM facilitates textless spoken language pro

Externí odkaz: http://arxiv.org/abs/2306.00697

Zobrazit plný text záznamu

Report

JNV Corpus: A Corpus of Japanese Nonverbal Vocalizations with Diverse Phrases and Emotions

Autor: Xin, Detai, Takamichi, Shinnosuke, Saruwatari, Hiroshi

We present JNV (Japanese Nonverbal Vocalizations) corpus, a corpus of Japanese nonverbal vocalizations (NVs) with diverse phrases and emotions. Existing Japanese NV corpora lack phrase or emotion diversity, which makes it difficult to analyze NVs and

Externí odkaz: http://arxiv.org/abs/2305.12445

Zobrazit plný text záznamu

Report

Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus

Autor: Xin, Detai, Takamichi, Shinnosuke, Morimatsu, Ai, Saruwatari, Hiroshi

We present a large-scale in-the-wild Japanese laughter corpus and a laughter synthesis method. Previous work on laughter synthesis lacks not only data but also proper ways to represent laughter. To solve these problems, we first propose an in-the-wil

Externí odkaz: http://arxiv.org/abs/2305.12442

Zobrazit plný text záznamu

Report

Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Autor: Yang, Dong, Koriyama, Tomoki, Saito, Yuki, Saeki, Takaaki, Xin, Detai, Saruwatari, Hiroshi

Pause insertion, also known as phrase break prediction and phrasing, is an essential part of TTS systems because proper pauses with natural duration significantly enhance the rhythm and intelligibility of synthetic speech. However, conventional phras

Externí odkaz: http://arxiv.org/abs/2302.13652

Zobrazit plný text záznamu

Report

Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts

Autor: Xin, Detai, Adavanne, Sharath, Ang, Federico, Kulkarni, Ashish, Takamichi, Shinnosuke, Saruwatari, Hiroshi

We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech. Previous work either uses

Externí odkaz: http://arxiv.org/abs/2211.02336

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání