Zobrazeno 1 - 10
of 220
pro vyhledávání: '"Yasuda, Yusuke"'
Autor:
Yasuda, Yusuke, Toda, Tomoki
A preference-based subjective evaluation is a key method for evaluating generative media reliably. However, its huge combinations of pairs prohibit it from being applied to large-scale evaluation using crowdsourcing. To address this issue, we propose
Externí odkaz:
http://arxiv.org/abs/2403.06100
One objective of Speech Quality Assessment (SQA) is to estimate the ranks of synthetic speech systems. However, recent SQA models are typically trained using low-precision direct scores such as mean opinion scores (MOS) as the training objective, whi
Externí odkaz:
http://arxiv.org/abs/2308.15203
Autor:
Yasuda, Yusuke, Toda, Tomoki
Text-to-speech synthesis (TTS) is a task to convert texts into speech. Two of the factors that have been driving TTS are the advancements of probabilistic models and latent representation learning. We propose a TTS method based on latent variable con
Externí odkaz:
http://arxiv.org/abs/2212.08329
Autor:
Yasuda, Yusuke, Toda, Tomoki
Publikováno v:
IEEE Journal of Selected Topics in Signal Processing (Volume: 16, Issue: 6, October 2022)
End-to-end text-to-speech synthesis (TTS) can generate highly natural synthetic speech from raw text. However, rendering the correct pitch accents is still a challenging problem for end-to-end TTS. To tackle the challenge of rendering correct pitch a
Externí odkaz:
http://arxiv.org/abs/2212.08321
Autor:
Hayashi, Tomoki, Yamamoto, Ryuichi, Yoshimura, Takenori, Wu, Peter, Shi, Jiatong, Saeki, Takaaki, Ju, Yooncheol, Yasuda, Yusuke, Takamichi, Shinnosuke, Watanabe, Shinji
This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, an
Externí odkaz:
http://arxiv.org/abs/2110.07840
We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic config
Externí odkaz:
http://arxiv.org/abs/2011.04839
We have been working on speech synthesis for rakugo (a traditional Japanese form of verbal entertainment similar to one-person stand-up comedy) toward speech synthesis that authentically entertains audiences. In this paper, we propose a novel evaluat
Externí odkaz:
http://arxiv.org/abs/2010.11549
Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS). We propose a new TTS framework using explicit duration modeling that incorporates duration as a discrete latent variable to TTS and ena
Externí odkaz:
http://arxiv.org/abs/2010.09602
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS does not require manual
Externí odkaz:
http://arxiv.org/abs/2005.10390
Previous work on speaker adaptation for end-to-end speech synthesis still falls short in speaker similarity. We investigate an orthogonal approach to the current speaker adaptation paradigms, speaker augmentation, by creating artificial speakers and
Externí odkaz:
http://arxiv.org/abs/2005.01245