Zobrazeno 1 - 10
of 1 995
pro vyhledávání: '"ZHAO, Zhou"'
Motion-to-music and music-to-motion have been studied separately, each attracting substantial research interest within their respective domains. The interaction between human motion and music is a reflection of advanced human intelligence, and establ
Externí odkaz:
http://arxiv.org/abs/2411.01805
Multimodal learning has developed very fast in recent years. However, during the multimodal training process, the model tends to rely on only one modality based on which it could learn faster, thus leading to inadequate use of other modalities. Exist
Externí odkaz:
http://arxiv.org/abs/2411.01409
Autor:
Cheng, Xize, Zheng, Siqi, Wang, Zehan, Fang, Minghui, Zhang, Ziang, Huang, Rongjie, Ma, Ziyang, Ji, Shengpeng, Zuo, Jialong, Jin, Tao, Zhao, Zhou
The scaling up has brought tremendous success in the fields of vision and language in recent years. When it comes to audio, however, researchers encounter a major challenge in scaling up the training data, as most natural audio contains diverse inter
Externí odkaz:
http://arxiv.org/abs/2410.21269
Autor:
Xiao, Wenyi, Wang, Zechuan, Gan, Leilei, Zhao, Shuai, He, Wanggui, Tuan, Luu Anh, Chen, Long, Jiang, Hao, Zhao, Zhou, Wu, Fei
With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free a
Externí odkaz:
http://arxiv.org/abs/2410.15595
Recent advancements in latent diffusion models (LDMs) have markedly enhanced text-to-audio generation, yet their iterative sampling processes impose substantial computational demands, limiting practical deployment. While recent methods utilizing cons
Externí odkaz:
http://arxiv.org/abs/2410.12266
Generating music that aligns with the visual content of a video has been a challenging task, as it requires a deep understanding of visual semantics and involves generating music whose melody, rhythm, and dynamics harmonize with the visual narratives
Externí odkaz:
http://arxiv.org/abs/2410.12957
Autor:
Ye, Zhenhui, Zhong, Tianyun, Ren, Yi, Jiang, Ziyue, Huang, Jiawei, Huang, Rongjie, Liu, Jinglin, He, Jinzheng, Zhang, Chen, Wang, Zehan, Chen, Xize, Yin, Xiang, Zhao, Zhou
Talking face generation (TFG) aims to animate a target identity's face to create realistic talking videos. Personalized TFG is a variant that emphasizes the perceptual identity similarity of the synthesized result (from the perspective of appearance
Externí odkaz:
http://arxiv.org/abs/2410.06734
Building upon advancements in Large Language Models (LLMs), the field of audio processing has seen increased interest in training audio generation tasks with discrete audio token sequences. However, directly discretizing audio by neural audio codecs
Externí odkaz:
http://arxiv.org/abs/2409.19283
Autor:
Zhang, Yu, Jiang, Ziyue, Li, Ruiqi, Pan, Changhao, He, Jinzheng, Huang, Rongjie, Wang, Chuxin, Zhao, Zhou
Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles (including singing method, emotion, rhythm, technique, and pronunciation) from audio and text pr
Externí odkaz:
http://arxiv.org/abs/2409.15977
Autor:
Zhang, Yu, Pan, Changhao, Guo, Wenxiang, Li, Ruiqi, Zhu, Zhiyuan, Wang, Jialei, Xu, Wenhao, Lu, Jingyu, Hong, Zhiqing, Wang, Chuxin, Zhang, LiChao, He, Jinzheng, Jiang, Ziyue, Chen, Yuxin, Yang, Chen, Zhou, Jiecheng, Cheng, Xinyu, Zhao, Zhou
The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singer
Externí odkaz:
http://arxiv.org/abs/2409.13832