Cacophony: An Improved Contrastive Audio-Text Model

Autor:	Zhu, Ge, Darefsky, Jordan, Duan, Zhiyao
Rok vydání:	2024
Předmět:	Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing
Druh dokumentu:	Working Paper
Popis:	Despite recent advancements, audio-text models still lag behind their image-text counterparts in scale and performance. In this paper, we propose to improve both the data scale and the training procedure of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset containing 13,000 hours of text-labeled audio, using pretrained language models to process noisy text descriptions and automatic captioning to obtain text descriptions for unlabeled audio samples. We first train on audio-only data with a masked autoencoder (MAE) objective, which allows us to benefit from the scalability of unlabeled audio datasets. We then train a contrastive model with an auxiliary captioning objective with the audio encoder initialized from the MAE model. Our final model, which we name Cacophony, achieves state-of-the-art performance on audio-text retrieval tasks, and exhibits competitive results on the HEAR benchmark and other downstream tasks such as zero-shot classification. Comment: Accepted at IEEE/ACM Transactions on Audio, Speech, and Language Processing
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2402.06986 Zobrazit plný text záznamu View this record from Arxiv