Výsledky vyhledávání

Report

MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence

Autor: You, Fuming, Fang, Minghui, Tang, Li, Huang, Rongjie, Wang, Yongqi, Zhao, Zhou

Motion-to-music and music-to-motion have been studied separately, each attracting substantial research interest within their respective domains. The interaction between human motion and music is a reflection of advanced human intelligence, and establ

Externí odkaz: http://arxiv.org/abs/2411.01805

Zobrazit plný text záznamu

Report

Classifier-guided Gradient Modulation for Enhanced Multimodal Learning

Autor: Guo, Zirun, Jin, Tao, Chen, Jingyuan, Zhao, Zhou

Multimodal learning has developed very fast in recent years. However, during the multimodal training process, the model tends to rely on only one modality based on which it could learn faster, thus leading to inadequate use of other modalities. Exist

Externí odkaz: http://arxiv.org/abs/2411.01409

Zobrazit plný text záznamu

Report

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

Autor: Cheng, Xize, Zheng, Siqi, Wang, Zehan, Fang, Minghui, Zhang, Ziang, Huang, Rongjie, Ma, Ziyang, Ji, Shengpeng, Zuo, Jialong, Jin, Tao, Zhao, Zhou

The scaling up has brought tremendous success in the fields of vision and language in recent years. When it comes to audio, however, researchers encounter a major challenge in scaling up the training data, as most natural audio contains diverse inter

Externí odkaz: http://arxiv.org/abs/2410.21269

Zobrazit plný text záznamu

Report

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

Autor: Xiao, Wenyi, Wang, Zechuan, Gan, Leilei, Zhao, Shuai, He, Wanggui, Tuan, Luu Anh, Chen, Long, Jiang, Hao, Zhao, Zhou, Wu, Fei

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free a

Externí odkaz: http://arxiv.org/abs/2410.15595

Zobrazit plný text záznamu

Report

FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation

Autor: Liu, Huadai, Wang, Jialei, Huang, Rongjie, Liu, Yang, Lu, Heng, Xue, Wei, Zhao, Zhou

Recent advancements in latent diffusion models (LDMs) have markedly enhanced text-to-audio generation, yet their iterative sampling processes impose substantial computational demands, limiting practical deployment. While recent methods utilizing cons

Externí odkaz: http://arxiv.org/abs/2410.12266

Zobrazit plný text záznamu

Report

MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

Autor: Li, Ruiqi, Zheng, Siqi, Cheng, Xize, Zhang, Ziang, Ji, Shengpeng, Zhao, Zhou

Generating music that aligns with the visual content of a video has been a challenging task, as it requires a deep understanding of visual semantics and involves generating music whose melody, rhythm, and dynamics harmonize with the visual narratives

Externí odkaz: http://arxiv.org/abs/2410.12957

Zobrazit plný text záznamu

Report

MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes

Autor: Ye, Zhenhui, Zhong, Tianyun, Ren, Yi, Jiang, Ziyue, Huang, Jiawei, Huang, Rongjie, Liu, Jinglin, He, Jinzheng, Zhang, Chen, Wang, Zehan, Chen, Xize, Yin, Xiang, Zhao, Zhou

Talking face generation (TFG) aims to animate a target identity's face to create realistic talking videos. Personalized TFG is a variant that emphasizes the perceptual identity similarity of the synthesized result (from the perspective of appearance

Externí odkaz: http://arxiv.org/abs/2410.06734

Zobrazit plný text záznamu

Report

Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models

Autor: Liu, Wenrui, Guo, Zhifang, Xu, Jin, Lv, Yuanjun, Chu, Yunfei, Zhao, Zhou, Lin, Junyang

Building upon advancements in Large Language Models (LLMs), the field of audio processing has seen increased interest in training audio generation tasks with discrete audio token sequences. However, directly discretizing audio by neural audio codecs

Externí odkaz: http://arxiv.org/abs/2409.19283

Zobrazit plný text záznamu

Report

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control

Autor: Zhang, Yu, Jiang, Ziyue, Li, Ruiqi, Pan, Changhao, He, Jinzheng, Huang, Rongjie, Wang, Chuxin, Zhao, Zhou

Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles (including singing method, emotion, rhythm, technique, and pronunciation) from audio and text pr

Externí odkaz: http://arxiv.org/abs/2409.15977

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání