YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls
Autor: | Chen, Zihao, Zhang, Haomin, Di, Xinhan, Wang, Haoyu, Shan, Sizhe, Zheng, Junjie, Liang, Yunming, Fan, Yihan, Zhu, Xinfa, Tian, Wenjie, Wang, Yihua, Ding, Chaofan, Xie, Lei |
---|---|
Rok vydání: | 2024 |
Předmět: | |
Druh dokumentu: | Working Paper |
Popis: | Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio-visual aggregator (AVA) that integrates high-resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi-modal visual-audio chain-of-thought (CoT) approach to generate finer sound effects in few-shot settings. Finally, an industry-standard video-to-audio (V2A) dataset that encompasses various real-world scenarios is presented. We show that YingSound effectively generates high-quality synchronized sounds across diverse conditional inputs through automated evaluations and human studies. Project Page: \url{https://giantailab.github.io/yingsound/} Comment: 16 pages, 4 figures |
Databáze: | arXiv |
Externí odkaz: |