Zobrazeno 1 - 10
of 36
pro vyhledávání: '"Sung, June Sig"'
Autor:
Ellinas, Nikolaos, Christidou, Myrsini, Vioni, Alexandra, Sung, June Sig, Chalamandaris, Aimilios, Tsiakoulis, Pirros, Mastorocostas, Paris
In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features f
Externí odkaz:
http://arxiv.org/abs/2211.16307
Autor:
Klapsas, Konstantinos, Nikitaras, Karolos, Ellinas, Nikolaos, Sung, June Sig, Hwang, Inchul, Raptis, Spyros, Chalamandaris, Aimilios, Tsiakoulis, Pirros
A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference. In this paper, we compare different prior architectures at t
Externí odkaz:
http://arxiv.org/abs/2211.01327
Autor:
Nikitaras, Karolos, Klapsas, Konstantinos, Ellinas, Nikolaos, Maniati, Georgia, Sung, June Sig, Hwang, Inchul, Raptis, Spyros, Chalamandaris, Aimilios, Tsiakoulis, Pirros
This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitl
Externí odkaz:
http://arxiv.org/abs/2211.00523
Autor:
Vioni, Alexandra, Maniati, Georgia, Ellinas, Nikolaos, Sung, June Sig, Hwang, Inchul, Chalamandaris, Aimilios, Tsiakoulis, Pirros
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained sel
Externí odkaz:
http://arxiv.org/abs/2211.00342
Autor:
Ellinas, Nikolaos, Vamvoukakis, Georgios, Markopoulos, Konstantinos, Maniati, Georgia, Kakoulidis, Panos, Sung, June Sig, Hwang, Inchul, Raptis, Spyros, Chalamandaris, Aimilios, Tsiakoulis, Pirros
This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language's pronunciation regardless of the original speaker's language. The model used is based on a non-attentive Tacotron architecture,
Externí odkaz:
http://arxiv.org/abs/2210.17264
Autor:
Nikitaras, Karolos, Vamvoukakis, Georgios, Ellinas, Nikolaos, Klapsas, Konstantinos, Markopoulos, Konstantinos, Raptis, Spyros, Sung, June Sig, Jho, Gunu, Chalamandaris, Aimilios, Tsiakoulis, Pirros
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary
Externí odkaz:
http://arxiv.org/abs/2204.05070
Autor:
Kakoulidis, Panos, Ellinas, Nikolaos, Vamvoukakis, Georgios, Markopoulos, Konstantinos, Sung, June Sig, Jho, Gunu, Tsiakoulis, Pirros, Chalamandaris, Aimilios
Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-b
Externí odkaz:
http://arxiv.org/abs/2204.04127
Autor:
Klapsas, Konstantinos, Ellinas, Nikolaos, Nikitaras, Karolos, Vamvoukakis, Georgios, Kakoulidis, Panos, Markopoulos, Konstantinos, Raptis, Spyros, Sung, June Sig, Jho, Gunu, Chalamandaris, Aimilios, Tsiakoulis, Pirros
Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework
Externí odkaz:
http://arxiv.org/abs/2204.03421
Autor:
Maniati, Georgia, Vioni, Alexandra, Ellinas, Nikolaos, Nikitaras, Karolos, Klapsas, Konstantinos, Sung, June Sig, Jho, Gunu, Chalamandaris, Aimilios, Tsiakoulis, Pirros
In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of mo
Externí odkaz:
http://arxiv.org/abs/2204.03040
Autor:
Park, Sangjun, Choo, Kihyun, Lee, Joohyung, Porov, Anton V., Osipov, Konstantin, Sung, June Sig
Text-to-Speech (TTS) services that run on edge devices have many advantages compared to cloud TTS, e.g., latency and privacy issues. However, neural vocoders with a low complexity and small model footprint inevitably generate annoying sounds. This st
Externí odkaz:
http://arxiv.org/abs/2203.14416