Zobrazeno 1 - 10
of 99
pro vyhledávání: '"Gao, Yingming"'
To address the limitation in multimodal emotion recognition (MER) performance arising from inter-modal information fusion, we propose a novel MER framework based on multitask learning where fusion occurs after alignment, called Foal-Net. The framewor
Externí odkaz:
http://arxiv.org/abs/2408.09438
Autor:
Fu, Ruibo, Liu, Rui, Qiang, Chunyu, Gao, Yingming, Lu, Yi, Shi, Shuchen, Wang, Tao, Li, Ya, Wen, Zhengqi, Zhang, Chen, Bu, Hui, Liu, Yukun, Qi, Xin, Li, Guanjun
The Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC 2024) is part of the ISCSLP 2024 Competitions and Challenges track. While current text-to-speech (TTS) technology can generate high-quality audio, its ability to convey complex e
Externí odkaz:
http://arxiv.org/abs/2407.12038
Diffusion-based singing voice conversion (SVC) models have shown better synthesis quality compared to traditional methods. However, in cross-domain SVC scenarios, where there is a significant disparity in pitch between the source and target voice dom
Externí odkaz:
http://arxiv.org/abs/2406.05692
Recent prompt-based text-to-speech (TTS) models can clone an unseen speaker using only a short speech prompt. They leverage a strong in-context ability to mimic the speech prompts, including speaker style, prosody, and emotion. Therefore, the selecti
Externí odkaz:
http://arxiv.org/abs/2406.03714
Recent advances in large language models (LLMs) and development of audio codecs greatly propel the zero-shot TTS. They can synthesize personalized speech with only a 3-second speech of an unseen speaker as acoustic prompt. However, they only support
Externí odkaz:
http://arxiv.org/abs/2406.03706
Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increa
Externí odkaz:
http://arxiv.org/abs/2401.01044
Speech emotion recognition (SER) systems aim to recognize human emotional state during human-computer interaction. Most existing SER systems are trained based on utterance-level labels. However, not all frames in an audio have affective states consis
Externí odkaz:
http://arxiv.org/abs/2312.16383
Autor:
Deng, Yayue, Xue, Jinlong, Jia, Yukang, Li, Qifei, Han, Yichen, Wang, Fengping, Gao, Yingming, Ke, Dengfeng, Li, Ya
Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehensi
Externí odkaz:
http://arxiv.org/abs/2312.10358
People have long hoped for a conversational system that can assist in real-life situations, and recent progress on large language models (LLMs) is bringing this idea closer to reality. While LLMs are often impressive in performance, their efficacy in
Externí odkaz:
http://arxiv.org/abs/2308.14536
Autor:
Xue, Jinlong, Deng, Yayue, Wang, Fengping, Li, Ya, Gao, Yingming, Tao, Jianhua, Sun, Jianqing, Liang, Jiaen
Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems
Externí odkaz:
http://arxiv.org/abs/2305.02269