QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

Autor:	Li, Chang, Wang, Ruoyu, Liu, Lijuan, Du, Jun, Sun, Yixuan, Guo, Zilu, Zhang, Zhenrong, Jiang, Yuan
Rok vydání:	2024
Předmět:	Computer Science - Sound Computer Science - Artificial Intelligence Electrical Engineering and Systems Science - Audio and Speech Processing
Druh dokumentu:	Working Paper
Popis:	In recent years, diffusion-based text-to-music (TTM) generation has gained prominence, offering an innovative approach to synthesizing musical content from textual descriptions. Achieving high accuracy and diversity in this generation process requires extensive, high-quality data, including both high-fidelity audio waveforms and detailed text descriptions, which often constitute only a small portion of available datasets. In open-source datasets, issues such as low-quality music waveforms, mislabeling, weak labeling, and unlabeled data significantly hinder the development of music generation models. To address these challenges, we propose a novel paradigm for high-quality music generation that incorporates a quality-aware training strategy, enabling generative models to discern the quality of input music waveforms during training. Leveraging the unique properties of musical signals, we first adapted and implemented a masked diffusion transformer (MDT) model for the TTM task, demonstrating its distinct capacity for quality control and enhanced musicality. Additionally, we address the issue of low-quality captions in TTM with a caption refinement data processing approach. Experiments demonstrate our state-of-the-art (SOTA) performance on MusicCaps and the Song-Describer Dataset. Our demo page can be accessed at https://qa-mdt.github.io/.
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2405.15863 Zobrazit plný text záznamu View this record from Arxiv