Zobrazeno 1 - 10
of 29
pro vyhledávání: '"Joly, Arnaud"'
Autor:
Łajszczak, Mateusz, Cámbara, Guillermo, Li, Yang, Beyhan, Fatih, van Korlaar, Arent, Yang, Fan, Joly, Arnaud, Martín-Cortinas, Álvaro, Abbas, Ammar, Michalski, Adam, Moinet, Alexis, Karlapati, Sri, Muszyńska, Ewa, Guo, Haohan, Putrycz, Bartosz, Gambino, Soledad López, Yoo, Kayeon, Sokolova, Elena, Drugman, Thomas
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public do
Externí odkaz:
http://arxiv.org/abs/2402.08093
Autor:
Joly, Arnaud, Nicolis, Marco, Peterova, Ekaterina, Lombardi, Alessandro, Abbas, Ammar, van Korlaar, Arent, Hussain, Aman, Sharma, Parul, Moinet, Alexis, Lajszczak, Mateusz, Karanasou, Penny, Bonafonte, Antonio, Drugman, Thomas, Sokolova, Elena
We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consi
Externí odkaz:
http://arxiv.org/abs/2307.07062
Autor:
Makarov, Peter, Abbas, Ammar, Łajszczak, Mateusz, Joly, Arnaud, Karlapati, Sri, Moinet, Alexis, Drugman, Thomas, Karanasou, Penny
Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based F
Externí odkaz:
http://arxiv.org/abs/2206.14643
Autor:
Lajszczak, Mateusz, Prasad, Animesh, van Korlaar, Arent, Bollepalli, Bajibabu, Bonafonte, Antonio, Joly, Arnaud, Nicolis, Marco, Moinet, Alexis, Drugman, Thomas, Wood, Trevor, Sokolova, Elena
This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available dur
Externí odkaz:
http://arxiv.org/abs/2202.06409
Autor:
Abbas, Ammar, Bollepalli, Bajibabu, Moinet, Alexis, Joly, Arnaud, Karanasou, Penny, Makarov, Peter, Slangens, Simon, Karlapati, Sri, Drugman, Thomas
We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale
Externí odkaz:
http://arxiv.org/abs/2106.15649
Autor:
Karanasou, Penny, Karlapati, Sri, Moinet, Alexis, Joly, Arnaud, Abbas, Ammar, Slangen, Simon, Trueba, Jaime Lorenzo, Drugman, Thomas
Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of proso
Externí odkaz:
http://arxiv.org/abs/2106.10229
Autor:
Karlapati, Sri, Abbas, Ammar, Hodari, Zack, Moinet, Alexis, Joly, Arnaud, Karanasou, Penny, Drugman, Thomas
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms a
Externí odkaz:
http://arxiv.org/abs/2011.02252
Autor:
Hodari, Zack, Moinet, Alexis, Karlapati, Sri, Lorenzo-Trueba, Jaime, Merritt, Thomas, Joly, Arnaud, Abbas, Ammar, Karanasou, Penny, Drugman, Thomas
Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic s
Externí odkaz:
http://arxiv.org/abs/2011.01175
Autor:
Karlapati, Sri, Moinet, Alexis, Joly, Arnaud, Klimkov, Viacheslav, Sáez-Trigueros, Daniel, Drugman, Thomas
Publikováno v:
INTERSPEECH 2020: 4387-4391
Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source au
Externí odkaz:
http://arxiv.org/abs/2004.14617
In many applications of supervised learning, multiple classification or regression outputs have to be predicted jointly. We consider several extensions of gradient boosting to address such problems. We first propose a straightforward adaptation of gr
Externí odkaz:
http://arxiv.org/abs/1905.07558