MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

Autor: Wang, Sen, Zhang, Jiangning, Cao, Weijian, Hu, Xiaobin, Li, Moran, Ji, Xiaozhong, Tan, Xin, Li, Mengtian, Xie, Zhifeng, Wang, Chengjie, Ma, Lizhuang
Rok vydání: 2024
Předmět:
Druh dokumentu: Working Paper
Popis: The body movements accompanying speech aid speakers in expressing their ideas. Co-speech motion generation is one of the important approaches for synthesizing realistic avatars. Due to the intricate correspondence between speech and motion, generating realistic and diverse motion is a challenging task. In this paper, we propose MMoFusion, a Multi-modal co-speech Motion generation framework based on the diffusion model to ensure both the authenticity and diversity of generated motion. We propose a progressive fusion strategy to enhance the interaction of inter-modal and intra-modal, efficiently integrating multi-modal information. Specifically, we employ a masked style matrix based on emotion and identity information to control the generation of different motion styles. Temporal modeling of speech and motion is partitioned into style-guided specific feature encoding and shared feature encoding, aiming to learn both inter-modal and intra-modal features. Besides, we propose a geometric loss to enforce the joints' velocity and acceleration coherence among frames. Our framework generates vivid, diverse, and style-controllable motion of arbitrary length through inputting speech and editing identity and emotion. Extensive experiments demonstrate that our method outperforms current co-speech motion generation methods including upper body and challenging full body.
Databáze: arXiv